It appears that the performance of the multimodal component may not have been evaluated:
Methodology review: MadriMed-VL-2B vs MedGemma 4B on MedXpertQA-MM, SLAKE, and VQA-RAD
This is a promising project, but I would be careful with the current headline.
The compact-medical-VLM direction is genuinely interesting: fine-tuning a 2B vision-language model locally, making it usable on consumer hardware, and testing it on medical VQA / medical reasoning tasks is a worthwhile experiment. The useful contribution could be: a reproducible small-model medical VLM workflow that runs locally and is evaluated honestly.
But the current benchmark claim:
MadriMed-VL-2B beats Google’s MedGemma 4B on MedXpertQA-MM.
is not methodologically safe yet.
The short version is:
The reported 21.05% MedXpertQA-MM score appears to be a text-only / zero-image multiple-choice run, not a confirmed multimodal MedXpertQA-MM score.
That does not mean the model or project is bad. It means the benchmark result needs to be reclassified and rerun with stricter multimodal validation.
Why I would not treat the current MedXpertQA-MM score as a confirmed multimodal result
The key issue is an image-field mismatch.
The MedXpertQA-MM evaluation loop appears to read:
image_list = row.get("image", [])
But MedXpertQA-MM uses an images field for the multimodal image filenames, not a singular image field.
This matters because row.get("image", []) does not fail loudly. It silently returns an empty list when the field is missing. Then the generation function can continue as a text-only run.
That is exactly what the printed result suggests:
MedXpertQA-MM Results (n=2000)
Accuracy : 21.05% (421/2000)
Unknown : 0 (0.0%)
Random Baseline: 20.00%
Per-Image-Count Breakdown:
0 image(s): 21.1% (421/2000)
So all 2,000 MedXpertQA-MM examples appear to have been evaluated with 0 loaded images.
That changes the interpretation completely.
Instead of:
MedXpertQA-MM multimodal score: 21.05%.
I would report it as:
MedXpertQA-MM text-only / zero-image baseline: 21.05%.
Useful references:
Why this matters statistically
MedXpertQA-MM has 5 answer choices per question. A random baseline is therefore about:
20%
The reported score is:
421 / 2000 = 21.05%
That is only 1.05 percentage points above random.
A rough 95% binomial confidence interval for 421 correct out of 2,000 is approximately:
19.3% to 22.8%
That interval includes 20%. So even before the image-loading issue, 21.05% is not strong evidence of meaningful MedXpertQA-MM generalization.
With the image-loading issue, the safer interpretation is:
The current result is a near-random text-only MCQ baseline, not evidence that the model beat MedGemma on multimodal MedXpertQA-MM.
This is also why I would avoid saying “beats MedGemma” until the corrected multimodal evaluation is rerun.
For evaluation statistics and paired comparisons, see:
Why the SLAKE / VQA-RAD gap is still meaningful
The reported SLAKE and VQA-RAD numbers are below MedGemma:
| Benchmark |
MadriMed-VL-2B |
MedGemma 4B |
Gap |
| SLAKE |
65.7 |
72.3 |
-6.6 |
| VQA-RAD |
43.09 |
49.9 |
-6.81 |
| MedXpertQA-MM |
21.05 |
18.8 |
+2.25 |
The SLAKE / VQA-RAD gap is plausible and informative.
Those datasets stress short-answer medical visual grounding:
- yes/no calibration;
- modality recognition;
- anatomy / organ recognition;
- abnormality recognition;
- concise answer formatting;
- answer normalization;
- tokenized F1 or exact-match scoring.
A model can become more medically fluent or better at multiple-choice answer selection while still being weaker on short-answer visual grounding.
References:
But the MedXpertQA result probably did not “hold up despite the SLAKE/VQA-RAD gap” yet. The more likely explanation is simpler:
SLAKE and VQA-RAD measured image-question answering, while the current MedXpertQA-MM run likely did not pass images at all.
So I would not explain the MedXpertQA number as successful multimodal generalization yet. I would first fix and rerun the evaluation.
Is the current methodology good?
For an exploratory notebook: yes, it is useful.
For a public benchmark claim against MedGemma: not yet.
The main methodological problems are:
| Issue |
Why it matters |
image vs images mismatch |
Converts MedXpertQA-MM into a zero-image run |
Silent fallback to [] |
Hides dataset-schema bugs |
Printed result shows 0 image(s) for all examples |
Confirms the multimodal path probably failed |
| 21.05% is near the 20% random baseline |
Weak statistical evidence |
| Local score compared to MedGemma model-card score |
Not a same-harness comparison |
| Different benchmark metrics |
SLAKE/VQA-RAD are often tokenized-F1 style; MedXpertQA is accuracy |
| No text-only / shuffled-image ablations |
Cannot prove image use |
| No confidence intervals / paired test |
Cannot judge whether a small delta is meaningful |
| No per-slice analysis |
Cannot identify what actually improved |
| No leakage checks described |
Medical VQA datasets are small and reused |
The strongest version of the work would fix these issues and then report results more conservatively.
What I would do next
1. Reclassify the current 21.05% result
I would change the result table from this:
| Benchmark |
MadriMed-VL-2B |
MedGemma 4B |
| MedXpertQA-MM |
21.05 |
18.8 |
to this:
| Benchmark |
Mode |
MadriMed-VL-2B |
Interpretation |
| MedXpertQA-MM |
Text-only / zero-image run |
21.05 |
Near-random 5-way MCQ baseline |
| MedXpertQA-MM |
Correct image + text |
pending |
True multimodal rerun |
| MedXpertQA-MM |
Shuffled image + text |
pending |
Image-use control |
| MedXpertQA-MM |
Options only |
pending |
Choice-prior control |
| SLAKE |
Image + text |
65.7 |
Short-answer medical VQA |
| VQA-RAD |
Image + text |
43.09 |
Radiology VQA |
That framing is much safer.
2. Make image loading strict
For MedXpertQA-MM, missing images should be a hard error.
Do not use:
image_list = row.get("image", [])
Use something stricter:
from pathlib import Path
from PIL import Image
def load_medxpertqa_mm_example(row, image_root: Path):
required = ["id", "question", "options", "label", "images"]
missing = [k for k in required if k not in row]
if missing:
raise KeyError(f"Missing required fields: {missing}. Available keys: {list(row.keys())}")
image_list = row["images"]
if isinstance(image_list, str):
image_list = [image_list]
if not isinstance(image_list, list):
raise TypeError(f"Expected row['images'] to be a list, got {type(image_list)}")
if len(image_list) == 0:
raise ValueError(f"MedXpertQA-MM row has no image filenames: id={row['id']}")
images = []
for filename in image_list:
path = image_root / filename
if not path.exists():
raise FileNotFoundError(f"Missing image file: {path}")
images.append(Image.open(path).convert("RGB"))
if len(images) != len(image_list):
raise RuntimeError(
f"Image count mismatch for id={row['id']}: "
f"filenames={len(image_list)}, loaded={len(images)}"
)
return {
"id": row["id"],
"question": row["question"],
"options": row["options"],
"label": row["label"].strip().upper(),
"image_filenames": image_list,
"images": images,
"medical_task": row.get("medical_task"),
"body_system": row.get("body_system"),
"question_type": row.get("question_type"),
}
Before running inference, print image telemetry:
from collections import Counter
filename_counts = Counter()
loaded_counts = Counter()
bad_rows = []
for row in dataset:
try:
ex = load_medxpertqa_mm_example(row, IMAGE_DIR)
filename_counts[len(ex["image_filenames"])] += 1
loaded_counts[len(ex["images"])] += 1
except Exception as e:
bad_rows.append({"id": row.get("id"), "error": repr(e)})
print("Rows:", len(dataset))
print("Image filename counts:", filename_counts)
print("Loaded image counts:", loaded_counts)
print("Bad rows:", len(bad_rows))
assert len(bad_rows) == 0
assert loaded_counts[0] == 0
assert sum(k * v for k, v in loaded_counts.items()) > 0
Every multimodal result should include something like:
| Run |
Rows |
Rows with image filenames |
Rows with loaded images |
Zero-image rows |
| MedXpertQA-MM image+text |
2000 |
2000 |
2000 |
0 |
If Zero-image rows = 2000, it is not a multimodal evaluation.
3. Add required MedXpertQA-MM ablations
For MedXpertQA-MM, one score is not enough.
I would run at least four modes:
| Mode |
Images |
Text |
Purpose |
| Text only |
none |
question + choices |
Measures vignette / option-prior reasoning |
| Correct image + text |
correct images |
question + choices |
Intended multimodal evaluation |
| Shuffled image + text |
wrong images |
question + choices |
Tests whether correct images matter |
| Options only |
none |
choices only |
Tests answer-choice / label-position priors |
Optional fifth mode:
| Mode |
Images |
Text |
Purpose |
| Image + options only |
correct images |
minimal question + choices |
Tests visual contribution without full vignette |
How to interpret:
| Pattern |
Interpretation |
| Correct image+text > text-only |
Images likely help |
| Correct image+text > shuffled-image |
Correct images matter |
| Shuffled-image ≈ correct image+text |
Model may not use image content |
| Text-only ≈ image+text |
Benchmark or model is text-dominant |
| Options-only above random |
Answer-choice priors exist |
| All modes near random |
Model is not solving the benchmark |
The key evidence for a real multimodal gain is:
correct image + text > text only
correct image + text > shuffled image + text
Without that, I would not claim MedXpertQA-MM multimodal generalization.
4. Compare against the base 2B model first
Before comparing to MedGemma, compare MadriMed-VL-2B to its own base model.
The most important question is:
Did fine-tuning actually improve the 2B model?
Use a table like this:
| Model |
MedX text-only |
MedX image+text |
MedX shuffled |
SLAKE |
VQA-RAD |
| Base 2B VLM |
|
|
|
|
|
| MadriMed-VL-2B |
|
|
|
|
|
| Delta |
|
|
|
|
|
Possible interpretations:
| Pattern |
Meaning |
| Fine-tuned model improves text-only only |
Better medical language / MCQ prior |
| Fine-tuned model improves image+text over shuffled |
Better visual grounding |
| Fine-tuned model improves SLAKE/VQA-RAD |
Better short-answer medical VQA |
| Fine-tuned model improves MCQ but not SLAKE/VQA-RAD |
Better answer-choice reasoning, weaker visual grounding |
| No improvement over base |
Fine-tune may not be effective |
A strong result does not have to beat MedGemma immediately. A strong result can be:
Fine-tuning improves a compact 2B VLM substantially over its base model, while remaining runnable locally.
5. Compare MedGemma only in the same harness
Do not make the main claim by comparing a local notebook score to a model-card score.
For a fair MedGemma comparison, run both models through the same evaluation harness:
| Component |
Requirement |
| Dataset revision |
Same for all models |
| Split |
Same for all models |
| Image files |
Same files |
| Image loader |
Same strict loader |
| Prompt |
Same task prompt, with only model-specific chat-template adaptation |
| Decoding |
Same deterministic settings |
| Answer extractor |
Same |
| Metric |
Same |
| Unknown handling |
Same |
| Logs |
Same JSONL schema |
A fair comparison table:
| Model |
Mode |
Accuracy |
95% CI |
Unknown |
Same harness |
| Base 2B |
Image+text |
|
|
|
yes |
| MadriMed-VL-2B |
Image+text |
|
|
|
yes |
| MedGemma 4B |
Image+text |
|
|
|
yes |
Only after this should you say whether MadriMed beats MedGemma.
Prompting suggestions
MedXpertQA-MM prompt
Use a strict MCQ prompt and short deterministic generation:
You are answering a medical multiple-choice question.
Use the clinical information and all provided images.
Question:
<question>
Answer choices:
A. <option_a>
B. <option_b>
C. <option_c>
D. <option_d>
E. <option_e>
Return only one letter: A, B, C, D, or E.
Final answer:
Recommended generation:
generate_kwargs = {
"max_new_tokens": 8,
"do_sample": False,
}
If the task only needs a letter, max_new_tokens=512 is unnecessarily long and can increase answer-extraction noise.
SLAKE / VQA-RAD prompt
Use a short-answer prompt:
Answer the medical image question using a short answer.
For yes/no questions, answer only yes or no.
For modality questions, answer only the modality name.
For anatomy questions, answer only the anatomical structure.
Question:
<question>
Answer:
Recommended generation:
generate_kwargs = {
"max_new_tokens": 16,
"do_sample": False,
}
Why different prompts?
Because MedXpertQA-MM is a multiple-choice reasoning task, while SLAKE and VQA-RAD are short-answer VQA tasks. A prompt that helps one can hurt the other.
Answer extraction suggestions
For MedXpertQA-MM:
import re
def extract_mcq_letter(text: str) -> str:
text = text.strip().upper()
patterns = [
r"FINAL\s+ANSWER\s*[:\-]?\s*\(?([ABCDE])\)?",
r"ANSWER\s*[:\-]?\s*\(?([ABCDE])\)?",
r"^\(?([ABCDE])\)?[\.\)]?$",
r"\b([ABCDE])\b",
]
for pattern in patterns:
match = re.search(pattern, text)
if match:
return match.group(1)
return "UNKNOWN"
Report:
| Diagnostic |
Why |
| Accuracy counting unknown as wrong |
Conservative score |
| Unknown rate |
Output-format compliance |
| Answer distribution |
Detects A/B/C/D/E bias |
| Raw outputs |
Lets others audit extraction |
| Per-choice accuracy |
Detects answer-position artifacts |
For SLAKE / VQA-RAD, report both:
- normalized exact match;
- tokenized F1;
- open-ended score;
- closed-ended score;
- yes/no accuracy.
Add answer-choice rotation for MedXpertQA-MM
Multiple-choice models can exploit option-position bias. Rotate options.
| Variant |
Option order |
| Original |
A B C D E |
| Rotation 1 |
B C D E A |
| Rotation 2 |
C D E A B |
| Rotation 3 |
D E A B C |
| Rotation 4 |
E A B C D |
Then map the predicted letter back to the semantic answer.
Report:
| Metric |
Meaning |
| Original-order accuracy |
Standard score |
| Rotation-mean accuracy |
More robust score |
| Semantic consistency |
Whether the same answer is chosen under rotations |
| Letter bias |
Whether the model over-picks A/B/C/D/E |
If accuracy collapses under rotation, the model may be exploiting option position rather than doing robust reasoning.
Add confidence intervals and paired tests
For accuracy:
from statsmodels.stats.proportion import proportion_confint
def wilson_ci(correct, total, alpha=0.05):
return proportion_confint(correct, total, alpha=alpha, method="wilson")
For model comparisons:
import numpy as np
def paired_bootstrap_diff(a_correct, b_correct, n_boot=10000, seed=0):
rng = np.random.default_rng(seed)
a_correct = np.asarray(a_correct, dtype=np.float32)
b_correct = np.asarray(b_correct, dtype=np.float32)
assert len(a_correct) == len(b_correct)
n = len(a_correct)
diffs = []
for _ in range(n_boot):
idx = rng.integers(0, n, size=n)
diffs.append(a_correct[idx].mean() - b_correct[idx].mean())
return np.percentile(diffs, [2.5, 50, 97.5])
Report deltas:
| Comparison |
Delta |
95% CI |
Meaning |
| MadriMed image+text − MadriMed text-only |
|
|
Visual contribution |
| MadriMed image+text − MadriMed shuffled-image |
|
|
Correct-image contribution |
| MadriMed image+text − Base 2B image+text |
|
|
Fine-tuning gain |
| MadriMed image+text − MedGemma image+text |
|
|
External comparison |
If the confidence interval crosses zero, do not call it a clear win.
Add per-slice analysis
MedXpertQA includes metadata such as medical task, body system, and question type. Use it.
MedXpertQA-MM slice table
| Slice |
n |
Text-only |
Image+text |
Shuffled-image |
Image gain |
| Diagnosis |
|
|
|
|
|
| Treatment |
|
|
|
|
|
| Basic medicine |
|
|
|
|
|
| Reasoning |
|
|
|
|
|
| Understanding |
|
|
|
|
|
| Cardiovascular |
|
|
|
|
|
| Dermatology-related |
|
|
|
|
|
| Radiology-heavy |
|
|
|
|
|
SLAKE / VQA-RAD slice table
| Slice |
Score |
Why it matters |
| Open-ended |
|
Harder answer normalization |
| Closed-ended |
|
Often yes/no-heavy |
| Yes/no |
|
Detects yes/no bias |
| Modality |
|
Basic visual recognition |
| Organ/body part |
|
Anatomical grounding |
| Abnormality |
|
Clinical visual interpretation |
| Location |
|
Spatial reasoning |
This will show whether the model is weak because it cannot see, cannot reason, cannot answer concisely, or cannot handle a particular modality.
Check for leakage
Medical VQA datasets are small and frequently reused. Leakage checks are important.
Check:
- exact image duplicates between train and test;
- perceptual image duplicates;
- repeated clinical vignettes;
- repeated question/answer pairs;
- synthetic data generated from benchmark examples;
- captions or filenames that reveal labels.
Example image hash check:
from PIL import Image
import imagehash
def phash(path):
return imagehash.phash(Image.open(path).convert("RGB"))
Then compare training images against MedXpertQA-MM, SLAKE, and VQA-RAD evaluation images.
References:
Training suggestions for the next version
Once evaluation is fixed, I would train in stages rather than mixing everything together.
Stage 1 — visual-medical grounding
Goal: teach the model to see medical images.
Examples:
Question: What modality is this?
Answer: CT
Question: Which body region is shown?
Answer: chest
Question: Does the image contain liver?
Answer: no
Question: Is there pleural effusion?
Answer: yes
Train on:
- modality recognition;
- anatomy recognition;
- body-region recognition;
- view / plane recognition;
- presence / absence;
- normal / abnormal.
Stage 2 — short-answer VQA
Goal: improve SLAKE / VQA-RAD.
Examples:
Question: Is there cardiomegaly?
Answer: yes
Question: What organ is shown?
Answer: lung
Question: What imaging modality is used?
Answer: x-ray
Make the answer short and canonical.
Stage 3 — clinical MCQ reasoning
Goal: improve MedXpertQA-style reasoning.
Example:
Question:
<clinical_vignette_plus_image_context>
Answer choices:
A. <option_a>
B. <option_b>
C. <option_c>
D. <option_d>
E. <option_e>
Final answer: C
Stage 4 — mixed replay
Goal: avoid overfitting to one format.
Mix:
- short-answer VQA;
- yes/no;
- modality/anatomy;
- MCQ;
- general VLM instruction samples;
- negative examples.
Fine-tuning target suggestions
Do not only test one fine-tuning configuration.
For a VLM, language-only tuning can improve medical wording without improving image use. Test at least three variants:
| Run |
Tuned components |
What it tests |
| A |
Language layers only |
Better medical language / answer style |
| B |
Projector + language layers |
Better image-text alignment |
| C |
Late vision blocks + projector + language |
Better medical visual adaptation |
Then evaluate:
| Signal |
What you want |
| Correct-image vs text-only |
Images help |
| Correct-image vs shuffled-image |
Correct images matter |
| SLAKE/VQA-RAD improvement |
Better visual grounding |
| MCQ improvement |
Better clinical answer-choice reasoning |
The best run is not necessarily the one with the highest raw score. It is the one that improves the intended
Final verdict
Is this good methodology?
Not yet.
It is a useful exploratory experiment, but not a strong benchmark methodology for claiming a win over MedGemma.
Did MedXpertQA generalization hold up?
Probably not yet. The current MedXpertQA-MM result appears to be a zero-image / text-only run, and the score is near the 20% random baseline for a 5-way MCQ task.
Why are SLAKE / VQA-RAD lower?
That part is plausible. SLAKE and VQA-RAD stress short-answer visual grounding, modality/anatomy recognition, yes/no calibration, and answer normalization. A small model can lag there even if it becomes better at medical wording or MCQ-style output.
What should happen next?
- Fix
image → images.
- Make missing images a hard error.
- Treat 21.05% as a text-only baseline.
- Rerun MedXpertQA-MM as image+text.
- Add text-only, shuffled-image, and options-only ablations.
- Run the base 2B model.
- Run MedGemma in the same harness.
- Report confidence intervals and paired deltas.
- Break results down by task, body system, question type, open/closed type, and yes/no type.
- Update the model card with conservative wording.
Short summary
- The project direction is good.
- The current MedXpertQA-MM headline is not yet supported.
- The evaluation appears to load
image, but the dataset uses images.
- The printed result shows every MedXpertQA-MM example under 0 image(s).
- Therefore, 21.05% is best treated as a text-only / zero-image baseline, not a multimodal score.
- SLAKE / VQA-RAD gaps are plausible because those benchmarks test short-answer visual grounding.
- The next step is a strict rerun with image validation, ablations, same-harness MedGemma comparison, and confidence intervals.
Useful links: