Fine-tune our first 2B medical VLM on a single MacBook M4, beats Google's MedGemma 4B on MedXpertQA-MM eval dataset

opsmadrisight · May 16, 2026, 4:21pm

Hey everyone,

I’ve been experimenting with fine-tuning smaller Vision-Language Models for specialized domains. I wanted to see how far I could push a 2B parameter model on medical reasoning tasks using targeted data.

Model: madrisight/MadriMed-VL-2B · Hugging Face

The benchmark performance vs. google/medgemma-4b-it

MedExpertQA-MM: 21.05 (vs. 18.8) — Outperforming a model twice its size on complex expertise tasks!

Slake: 65.7 (vs. 72.3)

VQA RAD: 43.09 (vs. 49.9)

Im also sharing the evaluation techniques: https://github.com/krrish-v/all_huggingface/blob/main/model_evaluation/MadriMed-VL-2B_evaluation.ipynb

Happy to discuss methodology, especially curious if anyone has thoughts on why the MedXpertQA generalization held up despite the SLAKE/VQA-RAD gap.

John6666 · May 19, 2026, 5:11am

It appears that the performance of the multimodal component may not have been evaluated:

Methodology review: MadriMed-VL-2B vs MedGemma 4B on MedXpertQA-MM, SLAKE, and VQA-RAD

This is a promising project, but I would be careful with the current headline.

The compact-medical-VLM direction is genuinely interesting: fine-tuning a 2B vision-language model locally, making it usable on consumer hardware, and testing it on medical VQA / medical reasoning tasks is a worthwhile experiment. The useful contribution could be: a reproducible small-model medical VLM workflow that runs locally and is evaluated honestly.

But the current benchmark claim:

MadriMed-VL-2B beats Google’s MedGemma 4B on MedXpertQA-MM.

is not methodologically safe yet.

The short version is:

The reported 21.05% MedXpertQA-MM score appears to be a text-only / zero-image multiple-choice run, not a confirmed multimodal MedXpertQA-MM score.

That does not mean the model or project is bad. It means the benchmark result needs to be reclassified and rerun with stricter multimodal validation.

Why I would not treat the current MedXpertQA-MM score as a confirmed multimodal result

The key issue is an image-field mismatch.

The MedXpertQA-MM evaluation loop appears to read:

image_list = row.get("image", [])

But MedXpertQA-MM uses an images field for the multimodal image filenames, not a singular image field.

This matters because row.get("image", []) does not fail loudly. It silently returns an empty list when the field is missing. Then the generation function can continue as a text-only run.

That is exactly what the printed result suggests:

MedXpertQA-MM Results (n=2000)
Accuracy      : 21.05% (421/2000)
Unknown       : 0 (0.0%)
Random Baseline: 20.00%

Per-Image-Count Breakdown:
0 image(s): 21.1% (421/2000)

So all 2,000 MedXpertQA-MM examples appear to have been evaluated with 0 loaded images.

That changes the interpretation completely.

Instead of:

MedXpertQA-MM multimodal score: 21.05%.

I would report it as:

MedXpertQA-MM text-only / zero-image baseline: 21.05%.

Useful references:

Why this matters statistically

MedXpertQA-MM has 5 answer choices per question. A random baseline is therefore about:

20%

The reported score is:

421 / 2000 = 21.05%

That is only 1.05 percentage points above random.

A rough 95% binomial confidence interval for 421 correct out of 2,000 is approximately:

19.3% to 22.8%

That interval includes 20%. So even before the image-loading issue, 21.05% is not strong evidence of meaningful MedXpertQA-MM generalization.

With the image-loading issue, the safer interpretation is:

The current result is a near-random text-only MCQ baseline, not evidence that the model beat MedGemma on multimodal MedXpertQA-MM.

This is also why I would avoid saying “beats MedGemma” until the corrected multimodal evaluation is rerun.

For evaluation statistics and paired comparisons, see:

Why the SLAKE / VQA-RAD gap is still meaningful

The reported SLAKE and VQA-RAD numbers are below MedGemma:

Benchmark	MadriMed-VL-2B	MedGemma 4B	Gap
SLAKE	65.7	72.3	-6.6
VQA-RAD	43.09	49.9	-6.81
MedXpertQA-MM	21.05	18.8	+2.25

The SLAKE / VQA-RAD gap is plausible and informative.

Those datasets stress short-answer medical visual grounding:

yes/no calibration;
modality recognition;
anatomy / organ recognition;
abnormality recognition;
concise answer formatting;
answer normalization;
tokenized F1 or exact-match scoring.

A model can become more medically fluent or better at multiple-choice answer selection while still being weaker on short-answer visual grounding.

References:

But the MedXpertQA result probably did not “hold up despite the SLAKE/VQA-RAD gap” yet. The more likely explanation is simpler:

SLAKE and VQA-RAD measured image-question answering, while the current MedXpertQA-MM run likely did not pass images at all.

So I would not explain the MedXpertQA number as successful multimodal generalization yet. I would first fix and rerun the evaluation.

Is the current methodology good?

For an exploratory notebook: yes, it is useful.

For a public benchmark claim against MedGemma: not yet.

The main methodological problems are:

Issue	Why it matters
`image` vs `images` mismatch	Converts MedXpertQA-MM into a zero-image run
Silent fallback to `[]`	Hides dataset-schema bugs
Printed result shows `0 image(s)` for all examples	Confirms the multimodal path probably failed
21.05% is near the 20% random baseline	Weak statistical evidence
Local score compared to MedGemma model-card score	Not a same-harness comparison
Different benchmark metrics	SLAKE/VQA-RAD are often tokenized-F1 style; MedXpertQA is accuracy
No text-only / shuffled-image ablations	Cannot prove image use
No confidence intervals / paired test	Cannot judge whether a small delta is meaningful
No per-slice analysis	Cannot identify what actually improved
No leakage checks described	Medical VQA datasets are small and reused

The strongest version of the work would fix these issues and then report results more conservatively.

What I would do next

1. Reclassify the current 21.05% result

I would change the result table from this:

Benchmark	MadriMed-VL-2B	MedGemma 4B
MedXpertQA-MM	21.05	18.8

to this:

Benchmark	Mode	MadriMed-VL-2B	Interpretation
MedXpertQA-MM	Text-only / zero-image run	21.05	Near-random 5-way MCQ baseline
MedXpertQA-MM	Correct image + text	pending	True multimodal rerun
MedXpertQA-MM	Shuffled image + text	pending	Image-use control
MedXpertQA-MM	Options only	pending	Choice-prior control
SLAKE	Image + text	65.7	Short-answer medical VQA
VQA-RAD	Image + text	43.09	Radiology VQA

That framing is much safer.

2. Make image loading strict

For MedXpertQA-MM, missing images should be a hard error.

Do not use:

image_list = row.get("image", [])

Use something stricter:

from pathlib import Path
from PIL import Image

def load_medxpertqa_mm_example(row, image_root: Path):
    required = ["id", "question", "options", "label", "images"]
    missing = [k for k in required if k not in row]
    if missing:
        raise KeyError(f"Missing required fields: {missing}. Available keys: {list(row.keys())}")

    image_list = row["images"]

    if isinstance(image_list, str):
        image_list = [image_list]

    if not isinstance(image_list, list):
        raise TypeError(f"Expected row['images'] to be a list, got {type(image_list)}")

    if len(image_list) == 0:
        raise ValueError(f"MedXpertQA-MM row has no image filenames: id={row['id']}")

    images = []
    for filename in image_list:
        path = image_root / filename
        if not path.exists():
            raise FileNotFoundError(f"Missing image file: {path}")

        images.append(Image.open(path).convert("RGB"))

    if len(images) != len(image_list):
        raise RuntimeError(
            f"Image count mismatch for id={row['id']}: "
            f"filenames={len(image_list)}, loaded={len(images)}"
        )

    return {
        "id": row["id"],
        "question": row["question"],
        "options": row["options"],
        "label": row["label"].strip().upper(),
        "image_filenames": image_list,
        "images": images,
        "medical_task": row.get("medical_task"),
        "body_system": row.get("body_system"),
        "question_type": row.get("question_type"),
    }

Before running inference, print image telemetry:

from collections import Counter

filename_counts = Counter()
loaded_counts = Counter()
bad_rows = []

for row in dataset:
    try:
        ex = load_medxpertqa_mm_example(row, IMAGE_DIR)
        filename_counts[len(ex["image_filenames"])] += 1
        loaded_counts[len(ex["images"])] += 1
    except Exception as e:
        bad_rows.append({"id": row.get("id"), "error": repr(e)})

print("Rows:", len(dataset))
print("Image filename counts:", filename_counts)
print("Loaded image counts:", loaded_counts)
print("Bad rows:", len(bad_rows))

assert len(bad_rows) == 0
assert loaded_counts[0] == 0
assert sum(k * v for k, v in loaded_counts.items()) > 0

Every multimodal result should include something like:

Run	Rows	Rows with image filenames	Rows with loaded images	Zero-image rows
MedXpertQA-MM image+text	2000	2000	2000	0

If Zero-image rows = 2000, it is not a multimodal evaluation.

3. Add required MedXpertQA-MM ablations

For MedXpertQA-MM, one score is not enough.

I would run at least four modes:

Mode	Images	Text	Purpose
Text only	none	question + choices	Measures vignette / option-prior reasoning
Correct image + text	correct images	question + choices	Intended multimodal evaluation
Shuffled image + text	wrong images	question + choices	Tests whether correct images matter
Options only	none	choices only	Tests answer-choice / label-position priors

Optional fifth mode:

Mode	Images	Text	Purpose
Image + options only	correct images	minimal question + choices	Tests visual contribution without full vignette

How to interpret:

Pattern	Interpretation
Correct image+text > text-only	Images likely help
Correct image+text > shuffled-image	Correct images matter
Shuffled-image ≈ correct image+text	Model may not use image content
Text-only ≈ image+text	Benchmark or model is text-dominant
Options-only above random	Answer-choice priors exist
All modes near random	Model is not solving the benchmark

The key evidence for a real multimodal gain is:

correct image + text > text only
correct image + text > shuffled image + text

Without that, I would not claim MedXpertQA-MM multimodal generalization.

4. Compare against the base 2B model first

Before comparing to MedGemma, compare MadriMed-VL-2B to its own base model.

The most important question is:

Did fine-tuning actually improve the 2B model?

Use a table like this:

Model	MedX text-only	MedX image+text	MedX shuffled	SLAKE	VQA-RAD
Base 2B VLM
MadriMed-VL-2B
Delta

Possible interpretations:

Pattern	Meaning
Fine-tuned model improves text-only only	Better medical language / MCQ prior
Fine-tuned model improves image+text over shuffled	Better visual grounding
Fine-tuned model improves SLAKE/VQA-RAD	Better short-answer medical VQA
Fine-tuned model improves MCQ but not SLAKE/VQA-RAD	Better answer-choice reasoning, weaker visual grounding
No improvement over base	Fine-tune may not be effective

A strong result does not have to beat MedGemma immediately. A strong result can be:

Fine-tuning improves a compact 2B VLM substantially over its base model, while remaining runnable locally.

5. Compare MedGemma only in the same harness

Do not make the main claim by comparing a local notebook score to a model-card score.

For a fair MedGemma comparison, run both models through the same evaluation harness:

Component	Requirement
Dataset revision	Same for all models
Split	Same for all models
Image files	Same files
Image loader	Same strict loader
Prompt	Same task prompt, with only model-specific chat-template adaptation
Decoding	Same deterministic settings
Answer extractor	Same
Metric	Same
Unknown handling	Same
Logs	Same JSONL schema

A fair comparison table:

Model	Mode	Same harness
Base 2B	Image+text	yes
MadriMed-VL-2B	Image+text	yes
MedGemma 4B	Image+text	yes

Only after this should you say whether MadriMed beats MedGemma.

Prompting suggestions

MedXpertQA-MM prompt

Use a strict MCQ prompt and short deterministic generation:

You are answering a medical multiple-choice question.
Use the clinical information and all provided images.

Question:
<question>

Answer choices:
A. <option_a>
B. <option_b>
C. <option_c>
D. <option_d>
E. <option_e>

Return only one letter: A, B, C, D, or E.
Final answer:

Recommended generation:

generate_kwargs = {
    "max_new_tokens": 8,
    "do_sample": False,
}

If the task only needs a letter, max_new_tokens=512 is unnecessarily long and can increase answer-extraction noise.

SLAKE / VQA-RAD prompt

Use a short-answer prompt:

Answer the medical image question using a short answer.

For yes/no questions, answer only yes or no.
For modality questions, answer only the modality name.
For anatomy questions, answer only the anatomical structure.

Question:
<question>

Answer:

Recommended generation:

generate_kwargs = {
    "max_new_tokens": 16,
    "do_sample": False,
}

Why different prompts?

Because MedXpertQA-MM is a multiple-choice reasoning task, while SLAKE and VQA-RAD are short-answer VQA tasks. A prompt that helps one can hurt the other.

Answer extraction suggestions

For MedXpertQA-MM:

import re

def extract_mcq_letter(text: str) -> str:
    text = text.strip().upper()

    patterns = [
        r"FINAL\s+ANSWER\s*[:\-]?\s*\(?([ABCDE])\)?",
        r"ANSWER\s*[:\-]?\s*\(?([ABCDE])\)?",
        r"^\(?([ABCDE])\)?[\.\)]?$",
        r"\b([ABCDE])\b",
    ]

    for pattern in patterns:
        match = re.search(pattern, text)
        if match:
            return match.group(1)

    return "UNKNOWN"

Report:

Diagnostic	Why
Accuracy counting unknown as wrong	Conservative score
Unknown rate	Output-format compliance
Answer distribution	Detects A/B/C/D/E bias
Raw outputs	Lets others audit extraction
Per-choice accuracy	Detects answer-position artifacts

For SLAKE / VQA-RAD, report both:

normalized exact match;
tokenized F1;
open-ended score;
closed-ended score;
yes/no accuracy.

Add answer-choice rotation for MedXpertQA-MM

Multiple-choice models can exploit option-position bias. Rotate options.

Variant	Option order
Original	A B C D E
Rotation 1	B C D E A
Rotation 2	C D E A B
Rotation 3	D E A B C
Rotation 4	E A B C D

Then map the predicted letter back to the semantic answer.

Report:

Metric	Meaning
Original-order accuracy	Standard score
Rotation-mean accuracy	More robust score
Semantic consistency	Whether the same answer is chosen under rotations
Letter bias	Whether the model over-picks A/B/C/D/E

If accuracy collapses under rotation, the model may be exploiting option position rather than doing robust reasoning.

Add confidence intervals and paired tests

For accuracy:

from statsmodels.stats.proportion import proportion_confint

def wilson_ci(correct, total, alpha=0.05):
    return proportion_confint(correct, total, alpha=alpha, method="wilson")

For model comparisons:

import numpy as np

def paired_bootstrap_diff(a_correct, b_correct, n_boot=10000, seed=0):
    rng = np.random.default_rng(seed)

    a_correct = np.asarray(a_correct, dtype=np.float32)
    b_correct = np.asarray(b_correct, dtype=np.float32)

    assert len(a_correct) == len(b_correct)

    n = len(a_correct)
    diffs = []

    for _ in range(n_boot):
        idx = rng.integers(0, n, size=n)
        diffs.append(a_correct[idx].mean() - b_correct[idx].mean())

    return np.percentile(diffs, [2.5, 50, 97.5])

Report deltas:

Comparison	Delta	95% CI	Meaning
MadriMed image+text − MadriMed text-only			Visual contribution
MadriMed image+text − MadriMed shuffled-image			Correct-image contribution
MadriMed image+text − Base 2B image+text			Fine-tuning gain
MadriMed image+text − MedGemma image+text			External comparison

If the confidence interval crosses zero, do not call it a clear win.

Add per-slice analysis

MedXpertQA includes metadata such as medical task, body system, and question type. Use it.

MedXpertQA-MM slice table

Slice	n	Text-only	Image+text	Shuffled-image	Image gain
Diagnosis
Treatment
Basic medicine
Reasoning
Understanding
Cardiovascular
Dermatology-related
Radiology-heavy

SLAKE / VQA-RAD slice table

Slice	Score	Why it matters
Open-ended		Harder answer normalization
Closed-ended		Often yes/no-heavy
Yes/no		Detects yes/no bias
Modality		Basic visual recognition
Organ/body part		Anatomical grounding
Abnormality		Clinical visual interpretation
Location		Spatial reasoning

This will show whether the model is weak because it cannot see, cannot reason, cannot answer concisely, or cannot handle a particular modality.

Check for leakage

Medical VQA datasets are small and frequently reused. Leakage checks are important.

Check:

exact image duplicates between train and test;
perceptual image duplicates;
repeated clinical vignettes;
repeated question/answer pairs;
synthetic data generated from benchmark examples;
captions or filenames that reveal labels.

Example image hash check:

from PIL import Image
import imagehash

def phash(path):
    return imagehash.phash(Image.open(path).convert("RGB"))

Then compare training images against MedXpertQA-MM, SLAKE, and VQA-RAD evaluation images.

References:

Training suggestions for the next version

Once evaluation is fixed, I would train in stages rather than mixing everything together.

Stage 1 — visual-medical grounding

Goal: teach the model to see medical images.

Examples:

Question: What modality is this?
Answer: CT

Question: Which body region is shown?
Answer: chest

Question: Does the image contain liver?
Answer: no

Question: Is there pleural effusion?
Answer: yes

Train on:

modality recognition;
anatomy recognition;
body-region recognition;
view / plane recognition;
presence / absence;
normal / abnormal.

Stage 2 — short-answer VQA

Goal: improve SLAKE / VQA-RAD.

Examples:

Question: Is there cardiomegaly?
Answer: yes

Question: What organ is shown?
Answer: lung

Question: What imaging modality is used?
Answer: x-ray

Make the answer short and canonical.

Stage 3 — clinical MCQ reasoning

Goal: improve MedXpertQA-style reasoning.

Example:

Question:
<clinical_vignette_plus_image_context>

Answer choices:
A. <option_a>
B. <option_b>
C. <option_c>
D. <option_d>
E. <option_e>

Final answer: C

Stage 4 — mixed replay

Goal: avoid overfitting to one format.

Mix:

short-answer VQA;
yes/no;
modality/anatomy;
MCQ;
general VLM instruction samples;
negative examples.

Fine-tuning target suggestions

Do not only test one fine-tuning configuration.

For a VLM, language-only tuning can improve medical wording without improving image use. Test at least three variants:

Run	Tuned components	What it tests
A	Language layers only	Better medical language / answer style
B	Projector + language layers	Better image-text alignment
C	Late vision blocks + projector + language	Better medical visual adaptation

Then evaluate:

Signal	What you want
Correct-image vs text-only	Images help
Correct-image vs shuffled-image	Correct images matter
SLAKE/VQA-RAD improvement	Better visual grounding
MCQ improvement	Better clinical answer-choice reasoning

The best run is not necessarily the one with the highest raw score. It is the one that improves the intended

Final verdict

Is this good methodology?

Not yet.

It is a useful exploratory experiment, but not a strong benchmark methodology for claiming a win over MedGemma.

Did MedXpertQA generalization hold up?

Probably not yet. The current MedXpertQA-MM result appears to be a zero-image / text-only run, and the score is near the 20% random baseline for a 5-way MCQ task.

Why are SLAKE / VQA-RAD lower?

That part is plausible. SLAKE and VQA-RAD stress short-answer visual grounding, modality/anatomy recognition, yes/no calibration, and answer normalization. A small model can lag there even if it becomes better at medical wording or MCQ-style output.

What should happen next?

Fix image → images.
Make missing images a hard error.
Treat 21.05% as a text-only baseline.
Rerun MedXpertQA-MM as image+text.
Add text-only, shuffled-image, and options-only ablations.
Run the base 2B model.
Run MedGemma in the same harness.
Report confidence intervals and paired deltas.
Break results down by task, body system, question type, open/closed type, and yes/no type.
Update the model card with conservative wording.

Short summary

The project direction is good.
The current MedXpertQA-MM headline is not yet supported.
The evaluation appears to load image, but the dataset uses images.
The printed result shows every MedXpertQA-MM example under 0 image(s).
Therefore, 21.05% is best treated as a text-only / zero-image baseline, not a multimodal score.
SLAKE / VQA-RAD gaps are plausible because those benchmarks test short-answer visual grounding.
The next step is a strict rerun with image validation, ablations, same-harness MedGemma comparison, and confidence intervals.

Useful links:

Topic		Replies	Views
Medgemma 1.5 4b, useful? Beginners	3	56	April 23, 2026
Seeking Professional Methodology for VLM Domain Fine-tuning: Analyzing 4 Experimental Strategies with Qwen2-VL Beginners	1	40	February 22, 2026
LLM for medical imaging Community Calls	1	153	February 3, 2026
LLaVA Steering: Why does grounding fix hallucinations in captioning but not in Yes/No QA? 🤗Transformers	1	54	February 19, 2026
For helping a Doctor! Please help me finetune the following model: hackint0sh/phi-3-clinical on the following dataset: openlifescienceai/medmcqa Beginners	2	85	November 21, 2024