Fine-tune our first 2B medical VLM on a single MacBook M4, beats Google's MedGemma 4B on MedXpertQA-MM eval dataset

Hey everyone,

I’ve been experimenting with fine-tuning smaller Vision-Language Models for specialized domains. I wanted to see how far I could push a 2B parameter model on medical reasoning tasks using targeted data.

Model: madrisight/MadriMed-VL-2B · Hugging Face

The benchmark performance vs. google/medgemma-4b-it

MedExpertQA-MM: 21.05 (vs. 18.8) — Outperforming a model twice its size on complex expertise tasks!

Slake: 65.7 (vs. 72.3)

VQA RAD: 43.09 (vs. 49.9)

Im also sharing the evaluation techniques: https://github.com/krrish-v/all_huggingface/blob/main/model_evaluation/MadriMed-VL-2B_evaluation.ipynb

Happy to discuss methodology, especially curious if anyone has thoughts on why the MedXpertQA generalization held up despite the SLAKE/VQA-RAD gap.

It appears that the performance of the multimodal component may not have been evaluated:


Methodology review: MadriMed-VL-2B vs MedGemma 4B on MedXpertQA-MM, SLAKE, and VQA-RAD

This is a promising project, but I would be careful with the current headline.

The compact-medical-VLM direction is genuinely interesting: fine-tuning a 2B vision-language model locally, making it usable on consumer hardware, and testing it on medical VQA / medical reasoning tasks is a worthwhile experiment. The useful contribution could be: a reproducible small-model medical VLM workflow that runs locally and is evaluated honestly.

But the current benchmark claim:

MadriMed-VL-2B beats Google’s MedGemma 4B on MedXpertQA-MM.

is not methodologically safe yet.

The short version is:

The reported 21.05% MedXpertQA-MM score appears to be a text-only / zero-image multiple-choice run, not a confirmed multimodal MedXpertQA-MM score.

That does not mean the model or project is bad. It means the benchmark result needs to be reclassified and rerun with stricter multimodal validation.


Why I would not treat the current MedXpertQA-MM score as a confirmed multimodal result

The key issue is an image-field mismatch.

The MedXpertQA-MM evaluation loop appears to read:

image_list = row.get("image", [])

But MedXpertQA-MM uses an images field for the multimodal image filenames, not a singular image field.

This matters because row.get("image", []) does not fail loudly. It silently returns an empty list when the field is missing. Then the generation function can continue as a text-only run.

That is exactly what the printed result suggests:

MedXpertQA-MM Results (n=2000)
Accuracy      : 21.05% (421/2000)
Unknown       : 0 (0.0%)
Random Baseline: 20.00%

Per-Image-Count Breakdown:
0 image(s): 21.1% (421/2000)

So all 2,000 MedXpertQA-MM examples appear to have been evaluated with 0 loaded images.

That changes the interpretation completely.

Instead of:

MedXpertQA-MM multimodal score: 21.05%.

I would report it as:

MedXpertQA-MM text-only / zero-image baseline: 21.05%.

Useful references:


Why this matters statistically

MedXpertQA-MM has 5 answer choices per question. A random baseline is therefore about:

20%

The reported score is:

421 / 2000 = 21.05%

That is only 1.05 percentage points above random.

A rough 95% binomial confidence interval for 421 correct out of 2,000 is approximately:

19.3% to 22.8%

That interval includes 20%. So even before the image-loading issue, 21.05% is not strong evidence of meaningful MedXpertQA-MM generalization.

With the image-loading issue, the safer interpretation is:

The current result is a near-random text-only MCQ baseline, not evidence that the model beat MedGemma on multimodal MedXpertQA-MM.

This is also why I would avoid saying “beats MedGemma” until the corrected multimodal evaluation is rerun.

For evaluation statistics and paired comparisons, see:


Why the SLAKE / VQA-RAD gap is still meaningful

The reported SLAKE and VQA-RAD numbers are below MedGemma:

Benchmark MadriMed-VL-2B MedGemma 4B Gap
SLAKE 65.7 72.3 -6.6
VQA-RAD 43.09 49.9 -6.81
MedXpertQA-MM 21.05 18.8 +2.25

The SLAKE / VQA-RAD gap is plausible and informative.

Those datasets stress short-answer medical visual grounding:

  • yes/no calibration;
  • modality recognition;
  • anatomy / organ recognition;
  • abnormality recognition;
  • concise answer formatting;
  • answer normalization;
  • tokenized F1 or exact-match scoring.

A model can become more medically fluent or better at multiple-choice answer selection while still being weaker on short-answer visual grounding.

References:

But the MedXpertQA result probably did not “hold up despite the SLAKE/VQA-RAD gap” yet. The more likely explanation is simpler:

SLAKE and VQA-RAD measured image-question answering, while the current MedXpertQA-MM run likely did not pass images at all.

So I would not explain the MedXpertQA number as successful multimodal generalization yet. I would first fix and rerun the evaluation.


Is the current methodology good?

For an exploratory notebook: yes, it is useful.

For a public benchmark claim against MedGemma: not yet.

The main methodological problems are:

Issue Why it matters
image vs images mismatch Converts MedXpertQA-MM into a zero-image run
Silent fallback to [] Hides dataset-schema bugs
Printed result shows 0 image(s) for all examples Confirms the multimodal path probably failed
21.05% is near the 20% random baseline Weak statistical evidence
Local score compared to MedGemma model-card score Not a same-harness comparison
Different benchmark metrics SLAKE/VQA-RAD are often tokenized-F1 style; MedXpertQA is accuracy
No text-only / shuffled-image ablations Cannot prove image use
No confidence intervals / paired test Cannot judge whether a small delta is meaningful
No per-slice analysis Cannot identify what actually improved
No leakage checks described Medical VQA datasets are small and reused

The strongest version of the work would fix these issues and then report results more conservatively.


What I would do next

1. Reclassify the current 21.05% result

I would change the result table from this:

Benchmark MadriMed-VL-2B MedGemma 4B
MedXpertQA-MM 21.05 18.8

to this:

Benchmark Mode MadriMed-VL-2B Interpretation
MedXpertQA-MM Text-only / zero-image run 21.05 Near-random 5-way MCQ baseline
MedXpertQA-MM Correct image + text pending True multimodal rerun
MedXpertQA-MM Shuffled image + text pending Image-use control
MedXpertQA-MM Options only pending Choice-prior control
SLAKE Image + text 65.7 Short-answer medical VQA
VQA-RAD Image + text 43.09 Radiology VQA

That framing is much safer.


2. Make image loading strict

For MedXpertQA-MM, missing images should be a hard error.

Do not use:

image_list = row.get("image", [])

Use something stricter:

from pathlib import Path
from PIL import Image

def load_medxpertqa_mm_example(row, image_root: Path):
    required = ["id", "question", "options", "label", "images"]
    missing = [k for k in required if k not in row]
    if missing:
        raise KeyError(f"Missing required fields: {missing}. Available keys: {list(row.keys())}")

    image_list = row["images"]

    if isinstance(image_list, str):
        image_list = [image_list]

    if not isinstance(image_list, list):
        raise TypeError(f"Expected row['images'] to be a list, got {type(image_list)}")

    if len(image_list) == 0:
        raise ValueError(f"MedXpertQA-MM row has no image filenames: id={row['id']}")

    images = []
    for filename in image_list:
        path = image_root / filename
        if not path.exists():
            raise FileNotFoundError(f"Missing image file: {path}")

        images.append(Image.open(path).convert("RGB"))

    if len(images) != len(image_list):
        raise RuntimeError(
            f"Image count mismatch for id={row['id']}: "
            f"filenames={len(image_list)}, loaded={len(images)}"
        )

    return {
        "id": row["id"],
        "question": row["question"],
        "options": row["options"],
        "label": row["label"].strip().upper(),
        "image_filenames": image_list,
        "images": images,
        "medical_task": row.get("medical_task"),
        "body_system": row.get("body_system"),
        "question_type": row.get("question_type"),
    }

Before running inference, print image telemetry:

from collections import Counter

filename_counts = Counter()
loaded_counts = Counter()
bad_rows = []

for row in dataset:
    try:
        ex = load_medxpertqa_mm_example(row, IMAGE_DIR)
        filename_counts[len(ex["image_filenames"])] += 1
        loaded_counts[len(ex["images"])] += 1
    except Exception as e:
        bad_rows.append({"id": row.get("id"), "error": repr(e)})

print("Rows:", len(dataset))
print("Image filename counts:", filename_counts)
print("Loaded image counts:", loaded_counts)
print("Bad rows:", len(bad_rows))

assert len(bad_rows) == 0
assert loaded_counts[0] == 0
assert sum(k * v for k, v in loaded_counts.items()) > 0

Every multimodal result should include something like:

Run Rows Rows with image filenames Rows with loaded images Zero-image rows
MedXpertQA-MM image+text 2000 2000 2000 0

If Zero-image rows = 2000, it is not a multimodal evaluation.


3. Add required MedXpertQA-MM ablations

For MedXpertQA-MM, one score is not enough.

I would run at least four modes:

Mode Images Text Purpose
Text only none question + choices Measures vignette / option-prior reasoning
Correct image + text correct images question + choices Intended multimodal evaluation
Shuffled image + text wrong images question + choices Tests whether correct images matter
Options only none choices only Tests answer-choice / label-position priors

Optional fifth mode:

Mode Images Text Purpose
Image + options only correct images minimal question + choices Tests visual contribution without full vignette

How to interpret:

Pattern Interpretation
Correct image+text > text-only Images likely help
Correct image+text > shuffled-image Correct images matter
Shuffled-image ≈ correct image+text Model may not use image content
Text-only ≈ image+text Benchmark or model is text-dominant
Options-only above random Answer-choice priors exist
All modes near random Model is not solving the benchmark

The key evidence for a real multimodal gain is:

correct image + text > text only
correct image + text > shuffled image + text

Without that, I would not claim MedXpertQA-MM multimodal generalization.


4. Compare against the base 2B model first

Before comparing to MedGemma, compare MadriMed-VL-2B to its own base model.

The most important question is:

Did fine-tuning actually improve the 2B model?

Use a table like this:

Model MedX text-only MedX image+text MedX shuffled SLAKE VQA-RAD
Base 2B VLM
MadriMed-VL-2B
Delta

Possible interpretations:

Pattern Meaning
Fine-tuned model improves text-only only Better medical language / MCQ prior
Fine-tuned model improves image+text over shuffled Better visual grounding
Fine-tuned model improves SLAKE/VQA-RAD Better short-answer medical VQA
Fine-tuned model improves MCQ but not SLAKE/VQA-RAD Better answer-choice reasoning, weaker visual grounding
No improvement over base Fine-tune may not be effective

A strong result does not have to beat MedGemma immediately. A strong result can be:

Fine-tuning improves a compact 2B VLM substantially over its base model, while remaining runnable locally.


5. Compare MedGemma only in the same harness

Do not make the main claim by comparing a local notebook score to a model-card score.

For a fair MedGemma comparison, run both models through the same evaluation harness:

Component Requirement
Dataset revision Same for all models
Split Same for all models
Image files Same files
Image loader Same strict loader
Prompt Same task prompt, with only model-specific chat-template adaptation
Decoding Same deterministic settings
Answer extractor Same
Metric Same
Unknown handling Same
Logs Same JSONL schema

A fair comparison table:

Model Mode Accuracy 95% CI Unknown Same harness
Base 2B Image+text yes
MadriMed-VL-2B Image+text yes
MedGemma 4B Image+text yes

Only after this should you say whether MadriMed beats MedGemma.


Prompting suggestions

MedXpertQA-MM prompt

Use a strict MCQ prompt and short deterministic generation:

You are answering a medical multiple-choice question.
Use the clinical information and all provided images.

Question:
<question>

Answer choices:
A. <option_a>
B. <option_b>
C. <option_c>
D. <option_d>
E. <option_e>

Return only one letter: A, B, C, D, or E.
Final answer:

Recommended generation:

generate_kwargs = {
    "max_new_tokens": 8,
    "do_sample": False,
}

If the task only needs a letter, max_new_tokens=512 is unnecessarily long and can increase answer-extraction noise.

SLAKE / VQA-RAD prompt

Use a short-answer prompt:

Answer the medical image question using a short answer.

For yes/no questions, answer only yes or no.
For modality questions, answer only the modality name.
For anatomy questions, answer only the anatomical structure.

Question:
<question>

Answer:

Recommended generation:

generate_kwargs = {
    "max_new_tokens": 16,
    "do_sample": False,
}

Why different prompts?

Because MedXpertQA-MM is a multiple-choice reasoning task, while SLAKE and VQA-RAD are short-answer VQA tasks. A prompt that helps one can hurt the other.


Answer extraction suggestions

For MedXpertQA-MM:

import re

def extract_mcq_letter(text: str) -> str:
    text = text.strip().upper()

    patterns = [
        r"FINAL\s+ANSWER\s*[:\-]?\s*\(?([ABCDE])\)?",
        r"ANSWER\s*[:\-]?\s*\(?([ABCDE])\)?",
        r"^\(?([ABCDE])\)?[\.\)]?$",
        r"\b([ABCDE])\b",
    ]

    for pattern in patterns:
        match = re.search(pattern, text)
        if match:
            return match.group(1)

    return "UNKNOWN"

Report:

Diagnostic Why
Accuracy counting unknown as wrong Conservative score
Unknown rate Output-format compliance
Answer distribution Detects A/B/C/D/E bias
Raw outputs Lets others audit extraction
Per-choice accuracy Detects answer-position artifacts

For SLAKE / VQA-RAD, report both:

  • normalized exact match;
  • tokenized F1;
  • open-ended score;
  • closed-ended score;
  • yes/no accuracy.

Add answer-choice rotation for MedXpertQA-MM

Multiple-choice models can exploit option-position bias. Rotate options.

Variant Option order
Original A B C D E
Rotation 1 B C D E A
Rotation 2 C D E A B
Rotation 3 D E A B C
Rotation 4 E A B C D

Then map the predicted letter back to the semantic answer.

Report:

Metric Meaning
Original-order accuracy Standard score
Rotation-mean accuracy More robust score
Semantic consistency Whether the same answer is chosen under rotations
Letter bias Whether the model over-picks A/B/C/D/E

If accuracy collapses under rotation, the model may be exploiting option position rather than doing robust reasoning.


Add confidence intervals and paired tests

For accuracy:

from statsmodels.stats.proportion import proportion_confint

def wilson_ci(correct, total, alpha=0.05):
    return proportion_confint(correct, total, alpha=alpha, method="wilson")

For model comparisons:

import numpy as np

def paired_bootstrap_diff(a_correct, b_correct, n_boot=10000, seed=0):
    rng = np.random.default_rng(seed)

    a_correct = np.asarray(a_correct, dtype=np.float32)
    b_correct = np.asarray(b_correct, dtype=np.float32)

    assert len(a_correct) == len(b_correct)

    n = len(a_correct)
    diffs = []

    for _ in range(n_boot):
        idx = rng.integers(0, n, size=n)
        diffs.append(a_correct[idx].mean() - b_correct[idx].mean())

    return np.percentile(diffs, [2.5, 50, 97.5])

Report deltas:

Comparison Delta 95% CI Meaning
MadriMed image+text − MadriMed text-only Visual contribution
MadriMed image+text − MadriMed shuffled-image Correct-image contribution
MadriMed image+text − Base 2B image+text Fine-tuning gain
MadriMed image+text − MedGemma image+text External comparison

If the confidence interval crosses zero, do not call it a clear win.


Add per-slice analysis

MedXpertQA includes metadata such as medical task, body system, and question type. Use it.

MedXpertQA-MM slice table

Slice n Text-only Image+text Shuffled-image Image gain
Diagnosis
Treatment
Basic medicine
Reasoning
Understanding
Cardiovascular
Dermatology-related
Radiology-heavy

SLAKE / VQA-RAD slice table

Slice Score Why it matters
Open-ended Harder answer normalization
Closed-ended Often yes/no-heavy
Yes/no Detects yes/no bias
Modality Basic visual recognition
Organ/body part Anatomical grounding
Abnormality Clinical visual interpretation
Location Spatial reasoning

This will show whether the model is weak because it cannot see, cannot reason, cannot answer concisely, or cannot handle a particular modality.


Check for leakage

Medical VQA datasets are small and frequently reused. Leakage checks are important.

Check:

  • exact image duplicates between train and test;
  • perceptual image duplicates;
  • repeated clinical vignettes;
  • repeated question/answer pairs;
  • synthetic data generated from benchmark examples;
  • captions or filenames that reveal labels.

Example image hash check:

from PIL import Image
import imagehash

def phash(path):
    return imagehash.phash(Image.open(path).convert("RGB"))

Then compare training images against MedXpertQA-MM, SLAKE, and VQA-RAD evaluation images.

References:


Training suggestions for the next version

Once evaluation is fixed, I would train in stages rather than mixing everything together.

Stage 1 — visual-medical grounding

Goal: teach the model to see medical images.

Examples:

Question: What modality is this?
Answer: CT

Question: Which body region is shown?
Answer: chest

Question: Does the image contain liver?
Answer: no

Question: Is there pleural effusion?
Answer: yes

Train on:

  • modality recognition;
  • anatomy recognition;
  • body-region recognition;
  • view / plane recognition;
  • presence / absence;
  • normal / abnormal.

Stage 2 — short-answer VQA

Goal: improve SLAKE / VQA-RAD.

Examples:

Question: Is there cardiomegaly?
Answer: yes

Question: What organ is shown?
Answer: lung

Question: What imaging modality is used?
Answer: x-ray

Make the answer short and canonical.

Stage 3 — clinical MCQ reasoning

Goal: improve MedXpertQA-style reasoning.

Example:

Question:
<clinical_vignette_plus_image_context>

Answer choices:
A. <option_a>
B. <option_b>
C. <option_c>
D. <option_d>
E. <option_e>

Final answer: C

Stage 4 — mixed replay

Goal: avoid overfitting to one format.

Mix:

  • short-answer VQA;
  • yes/no;
  • modality/anatomy;
  • MCQ;
  • general VLM instruction samples;
  • negative examples.

Fine-tuning target suggestions

Do not only test one fine-tuning configuration.

For a VLM, language-only tuning can improve medical wording without improving image use. Test at least three variants:

Run Tuned components What it tests
A Language layers only Better medical language / answer style
B Projector + language layers Better image-text alignment
C Late vision blocks + projector + language Better medical visual adaptation

Then evaluate:

Signal What you want
Correct-image vs text-only Images help
Correct-image vs shuffled-image Correct images matter
SLAKE/VQA-RAD improvement Better visual grounding
MCQ improvement Better clinical answer-choice reasoning

The best run is not necessarily the one with the highest raw score. It is the one that improves the intended


Final verdict

Is this good methodology?

Not yet.

It is a useful exploratory experiment, but not a strong benchmark methodology for claiming a win over MedGemma.

Did MedXpertQA generalization hold up?

Probably not yet. The current MedXpertQA-MM result appears to be a zero-image / text-only run, and the score is near the 20% random baseline for a 5-way MCQ task.

Why are SLAKE / VQA-RAD lower?

That part is plausible. SLAKE and VQA-RAD stress short-answer visual grounding, modality/anatomy recognition, yes/no calibration, and answer normalization. A small model can lag there even if it becomes better at medical wording or MCQ-style output.

What should happen next?

  1. Fix imageimages.
  2. Make missing images a hard error.
  3. Treat 21.05% as a text-only baseline.
  4. Rerun MedXpertQA-MM as image+text.
  5. Add text-only, shuffled-image, and options-only ablations.
  6. Run the base 2B model.
  7. Run MedGemma in the same harness.
  8. Report confidence intervals and paired deltas.
  9. Break results down by task, body system, question type, open/closed type, and yes/no type.
  10. Update the model card with conservative wording.

Short summary

  • The project direction is good.
  • The current MedXpertQA-MM headline is not yet supported.
  • The evaluation appears to load image, but the dataset uses images.
  • The printed result shows every MedXpertQA-MM example under 0 image(s).
  • Therefore, 21.05% is best treated as a text-only / zero-image baseline, not a multimodal score.
  • SLAKE / VQA-RAD gaps are plausible because those benchmarks test short-answer visual grounding.
  • The next step is a strict rerun with image validation, ablations, same-harness MedGemma comparison, and confidence intervals.

Useful links: