[Guide] How I debugged T5 fine-tuning for a medical diagnosis task

LLM Fine-tuning Debugging Guide: Systematische Problemlösung in der Praxis

Ein kompletter Walkthrough vom ersten Problem bis zum funktionierenden medizinischen LLM

:bullseye: Projektziel

Entwicklung eines medizinischen LLM zur Diagnose-UnterstĂŒtzung mittels T5-Fine-tuning


:clipboard: Ausgangssituation

UrsprĂŒnglicher Code (funktionierend, aber begrenzt)

import pandas as pd
import transformers
import torch
from transformers import T5Tokenizer, T5ForConditionalGeneration
from datasets import Dataset
from transformers import DataCollatorForSeq2Seq, Trainer, TrainingArguments

data = [
    {"input": "Symptome: Fieber, Husten. CRP: 67. Bildgebung: Infiltrat basal rechts. Was ist die wahrscheinlichste Diagnose?", "output": "Pneumonie"},
    {"input": "Symptome: Dyspnoe, Beinschwellung links. D-Dimer erhöht. Was ist die wahrscheinlichste Diagnose?", "output": "Lungenembolie"},
    {"input": "Symptome: MĂŒdigkeit, BlĂ€sse. Hb: niedrig. Was ist die wahrscheinlichste Diagnose?", "output": "AnĂ€mie"},
    {"input": "Symptome: Brustschmerz, Troponin hoch, EKG ST-Hebung. Was ist die wahrscheinlichste Diagnose?", "output": "Herzinfarkt"},
    {"input": "Symptome: Polyurie, Polydipsie, BZ 320 mg/dl. Was ist die wahrscheinlichste Diagnose?", "output": "Diabetes mellitus"}
]

data = pd.DataFrame(data)
tokenizer = T5Tokenizer.from_pretrained("t5-small")

def tokenize(example):
    input_enc = tokenizer(example["input"], truncation=True, padding="max_length", max_length=128)
    output_enc = tokenizer(example["output"], truncation=True, padding="max_length", max_length=32)
    input_enc["labels"] = output_enc["input_ids"]
    return input_enc

dataset = Dataset.from_pandas(data)
tokenized_dataset = dataset.map(tokenize)
model = T5ForConditionalGeneration.from_pretrained("t5-small")

training_args = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=2,
    num_train_epochs=20,
    logging_steps=1,
    save_strategy="no",
    report_to="none"
)

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator
)

trainer.train()

def predict(prompt):
    inputs = tokenizer(prompt, return_tensors="pt", padding=True, truncation=True).input_ids
    outputs = model.generate(inputs, max_length=32)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

test_prompt = "Symptome: Atemnot, Fieber, CRP 90, Röntgen: Infiltrat rechts. Was ist die wahrscheinlichste Diagnose?"
print("Antwort:", predict(test_prompt))

Erste Ergebnisse (problematisch aber funktional)

  • Output: "Pneumonie. DD: Pneumonie, Pneumonie" (repetitiv)
  • Loss: 8.78 → 0.43 (sehr gut)
  • Problem: Repetitive/falsche Differentialdiagnosen

:police_car_light: Problem Phase 1: Strukturelle Verbesserung fĂŒhrt zu “True”-Bug

Versuch: Erweiterte Features implementieren

Ziel: 100 Beispiele, Validation Split, bessere Output-Struktur

Änderungen:

  • Dataset auf 100 Beispiele erweitert
  • Strukturierte DD-Ausgabe: "Diagnose: X | DD: Y, Z, W"
  • Train/Validation Split (80/20)
  • as_target_tokenizer() → text_target (deprecated fix)
  • tokenizer → processing_class parameter

Problem: “True”-Bug

# Expected: "Pneumonie. DD: Bronchitis, Pleuritis"  
# Actual: "True"

Symptome:

  • Alle Outputs nur noch "True"
  • Model verhĂ€lt sich wie Binary Classifier
  • Missing keys warning: embed_tokens.weight, lm_head.weight

:magnifying_glass_tilted_left: Debugging Phase 1: Systematische Problemidentifikation

Step 1: Parameter-InstabilitÀt-Hypothese

Beobachtung: Mehrere deprecated/neue Parameter gleichzeitig geÀndert

  • evaluation_strategy → TypeError
  • processing_class vs tokenizer
  • text_target vs as_target_tokenizer()

Hypothese: Neue Parameter sind instabil, alte Parameter funktionieren besser

Step 2: Schrittweise RĂŒckfĂŒhrung

Strategie: Eine Variable zur Zeit Àndern

Test 1: as_target_tokenizer() Fix

# ZurĂŒck zu deprecated aber funktionierender Methode
with tokenizer.as_target_tokenizer():
    output_enc = tokenizer(example["output"], ...)

Ergebnis: "rmelkinese" (korrupt, aber nicht mehr “True”)

Test 2: Original vs Fix Vergleich

Ergebnis: Beide Male "rmelkinese" → Problem liegt woanders


:broom: Debugging Phase 2: Fresh Environment Strategy

Step 3: Clean Slate Approach

Entscheidung: Fresh Notebook, zurĂŒck zur funktionierenden Basis

Baseline Test (5 Beispiele, Original-Code):

# Minimaler Test fĂŒr Root Cause Isolation
data = [ursprĂŒngliche 5 Beispiele ohne DD]

Ergebnis: "Was ist die wahrscheinlichste Diagnose?" (Input-Echo)


:microscope: Debugging Phase 3: Pipeline-Diagnose

Step 4: Labels-Debug

Check: Sind Labels korrekt tokenisiert?

print("Sample tokenized data:")
print(f"Labels: {tokenized_dataset[0]['labels'][:10]}")
print(f"Decoded Labels: {tokenizer.decode(tokenized_dataset[0]['labels'])}")

Ergebnis: :white_check_mark: Labels perfekt: "Pneumonie</s><pad>..."

Step 5: Attention Mask Debug

Check: Funktioniert die Attention-Mechanik?

inputs = tokenizer(prompt, return_tensors="pt", padding=True, truncation=True, max_length=128)
print(f"Attention mask: {inputs.attention_mask}")
print(f"Attention mask sum: {inputs.attention_mask[0].sum()}")

Ergebnis: :white_check_mark: Attention perfekt: 36/36 tokens attended

Step 6: EOS/PAD Token Debug

Check: Token-Handling korrekt?

print(f"PAD token: '{tokenizer.pad_token}' -> ID: {tokenizer.pad_token_id}")
print(f"EOS token: '{tokenizer.eos_token}' -> ID: {tokenizer.eos_token_id}")

Ergebnis: :white_check_mark: Token-Setup korrekt, aber Generation produziert Input-Echo


:police_car_light: Problem Phase 2: DataCollator Crash

Step 7: Label-Training-Pipeline Debug

Tieferer Test: Was passiert im Training?

CRASH:

ValueError: Unable to create tensor... Perhaps your features (`input` in this case) have excessive nesting

Root Cause: String Features im Dataset

Problem: DataCollator kann nicht alle Features tensorizieren

tokenized_dataset.features = {
    "input": "string",      # ❌ DataCollator crash  
    "output": "string",     # ❌ DataCollator crash
    "input_ids": "tensor",  # ✅ OK
    "labels": "tensor"      # ✅ OK
}

Fix: String Features entfernen

tokenized_dataset = tokenized_dataset.remove_columns(["input", "output"])

Ergebnis: Training lÀuft, aber Output immer noch falsch


:magnifying_glass_tilted_left: Debugging Phase 4: T5-spezifische Probleme

Step 8: T5 Training Mode Check

Check: Versteht T5 unseren Task?

DISCOVERY: T5 hat task-specific parameters:

model.config.task_specific_params = {
    'summarization': {'prefix': 'summarize: '},
    'translation_en_to_de': {'prefix': 'translate English to German: '},
    ...
}

Problem: T5 versteht ohne Task-Prefix nicht was zu tun ist!

Step 9: Task Prefix Implementation

def tokenize_with_task_prefix(example):
    task_prefixed_input = f"medical diagnosis: {example['input']}"
    input_enc = tokenizer(task_prefixed_input, truncation=True, padding="max_length", max_length=128)
    output_enc = tokenizer(example["output"], truncation=True, padding="max_length", max_length=32)
    input_enc["labels"] = output_enc["input_ids"]
    return input_enc

Ergebnis: Input-Echo stoppt, aber nur leere Outputs


:police_car_light: Problem Phase 3: PAD Token Loop

Step 10: Generation Mechanism Debug

Problem: Model generiert nur PAD tokens [0,0,0,...]

Deep Debug:

# Raw token analysis
outputs = model.generate(inputs, max_length=32, do_sample=False)
print(f"Raw tokens: {outputs[0]}")
# Result: [0, 0, 0, 0, 0, 0, ...]

Hypothese: Training Volume vs Decoder Mechanism

Diskussion:

  • Sind 10 Epochen zu wenig fĂŒr Task Prefix Learning?
  • Oder ist Decoder-Start-Mechanism kaputt?

Step 11: A/B Test Strategy

Test 1: Continue Training (+20 Epochen)
Test 2: Fresh Training (30 Epochen from scratch)

Continue Training Ergebnis:

  • Loss: 2.0 → 0.15-0.30
  • Output: "Morbus Morbus Morbus..." :white_check_mark: (medizinische Begriffe, aber repetitiv)

Fresh Training Ergebnis:

  • Loss: 10.1 → 0.30-0.85
  • Output: "" (leer, PAD tokens)

Conclusion: Continue Training ist besser als Fresh!


:bullseye: Breakthrough Phase: Generation Parameter Optimization

Step 12: Improved Generation Parameters

Problem: Repetitive Output ("Morbus Morbus Morbus...")

Solution: Advanced Generation Parameters

def predict_improved(prompt):
    prefixed_prompt = f"medical diagnosis: {prompt}"
    inputs = tokenizer(prefixed_prompt, return_tensors="pt", padding=True, truncation=True)
    
    outputs = model.generate(
        input_ids=inputs.input_ids,
        attention_mask=inputs.attention_mask,
        max_new_tokens=32,
        repetition_penalty=2.0,    # ← Anti-repetition
        num_beams=4,               # ← Better quality
        early_stopping=True,       # ← Stop at EOS
        eos_token_id=tokenizer.eos_token_id
    )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

Breakthrough Results:

  • Input: "Symptome: Atemnot, Fieber, CRP 90..."
  • Output: "Atemnot, Fieber, CRP 90, Röntgen" :white_check_mark:

Analysis: Model extrahiert relevante medizinische Information, aber noch keine Diagnose!


:rocket: Final Success Phase: Scale & Training Optimization

Step 13: Dataset & Training Scale-Up

Strategy: Mehr Daten + Intensiveres Training

Scaling:

  • 25 → 160 Beispiele (6x mehr Daten)
  • 30 → 40 Epochen (mehr Training)
  • 19 medizinische Fachbereiche abgedeckt

Optimierte Training-Parameter:

training_args = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=4,  # GrĂ¶ĂŸere Batches
    num_train_epochs=40,            # Mehr Epochen
    learning_rate=3e-4,             # Optimierte LR
    warmup_steps=50,                # Warmup fĂŒr StabilitĂ€t
    logging_steps=10,
    save_strategy="no",
    report_to="none"
)

Final Training Results:

  • Loss: 9.9 → 0.009 (Outstanding!)
  • 160 Beispiele erfolgreich trainiert
  • 40 Epochen mit perfekter Konvergenz

:trophy: ERFOLG: Funktionierendes Medizinisches LLM

Final Test Results (100% Success Rate):

Test Input Generated Expected Status
1 Fieber, Husten, Infiltrat Pneumonie Pneumonie :white_check_mark:
2 Brustschmerz, Troponin, ST-Hebung Herzinfarkt Herzinfarkt :white_check_mark:
3 Polyurie, BZ 320 mg/dl Diabetes mellitus Diabetes mellitus :white_check_mark:
4 Tremor, Rigor, Bradykinesie Morbus Parkinson Morbus Parkinson :white_check_mark:
5 Kopfschmerz, Meningismus Meningitis Meningitis :white_check_mark:

:clipboard: Debugging-Schritte Zusammenfassung

:magnifying_glass_tilted_left: Systematische Problemidentifikation

  1. Parameter-InstabilitÀt-Analyse

    • Cross-Pattern Recognition zwischen verschiedenen deprecated warnings
    • Isolierung einzelner Parameter-Änderungen
  2. Pipeline-Komponenten-Test

    • Labels-Tokenization :white_check_mark:
    • Attention Mask :white_check_mark:
    • EOS/PAD Token Handling :white_check_mark:
    • DataCollator :cross_mark: → FIXED
  3. T5-spezifische Anforderungen

    • Task Prefix Requirement identifiziert
    • Encoder-Decoder Pipeline verstanden
  4. Generation-Mechanismus-Optimierung

    • Parameter-Tuning fĂŒr Anti-Repetition
    • Beam Search fĂŒr bessere QualitĂ€t
  5. Scale & Training-Optimierung

    • Dataset-GrĂ¶ĂŸe als kritischer Faktor
    • Training-Volumen fĂŒr komplexe Tasks

:hammer_and_wrench: Finale Code-Lösung

import pandas as pd
import transformers
import torch
from transformers import T5Tokenizer, T5ForConditionalGeneration
from datasets import Dataset
from transformers import DataCollatorForSeq2Seq, Trainer, TrainingArguments

# GROSSE DATENBASIS: 160 medizinische Beispiele
data = [
    # ... [160 Beispiele aus 19 Fachbereichen]
]

tokenizer = T5Tokenizer.from_pretrained("t5-small")

# T5 TASK PREFIX (kritisch fĂŒr T5-Performance)
def tokenize_with_task_prefix(example):
    task_prefixed_input = f"medical diagnosis: {example['input']}"
    input_enc = tokenizer(task_prefixed_input, truncation=True, padding="max_length", max_length=128)
    output_enc = tokenizer(example["output"], truncation=True, padding="max_length", max_length=32)
    input_enc["labels"] = output_enc["input_ids"]
    return input_enc

dataset = Dataset.from_pandas(data)
tokenized_dataset = dataset.map(tokenize_with_task_prefix)

# DATACOLLATOR FIX: String features entfernen
tokenized_dataset = tokenized_dataset.remove_columns(["input", "output"])

model = T5ForConditionalGeneration.from_pretrained("t5-small")

# OPTIMIERTE TRAINING-PARAMETER
training_args = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=4,
    num_train_epochs=40,
    learning_rate=3e-4,
    warmup_steps=50,
    logging_steps=10,
    save_strategy="no",
    report_to="none"
)

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator
)

trainer.train()

# OPTIMIERTE PREDICTION-FUNKTION
def predict_medical_diagnosis(prompt):
    prefixed_prompt = f"medical diagnosis: {prompt}"
    inputs = tokenizer(prefixed_prompt, return_tensors="pt", padding=True, truncation=True)
    
    outputs = model.generate(
        input_ids=inputs.input_ids,
        attention_mask=inputs.attention_mask,
        max_new_tokens=32,
        repetition_penalty=2.0,    # Anti-repetition
        num_beams=4,               # Bessere QualitÀt
        early_stopping=True,       # Stop bei EOS
        eos_token_id=tokenizer.eos_token_id
    )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# TEST
test_prompt = "Symptome: Atemnot, Fieber, CRP 90, Röntgen: Infiltrat rechts. Was ist die wahrscheinlichste Diagnose?"
result = predict_medical_diagnosis(test_prompt)
print(f"Diagnose: {result}")  # Output: "Pneumonie"

:bar_chart: Kritische Erfolgsfaktoren

:white_check_mark: Must-Have Komponenten:

  1. T5 Task Prefix: "medical diagnosis: " - Essentiell fĂŒr T5-VerstĂ€ndnis
  2. DataCollator Fix: String features entfernen
  3. Sufficient Data: Mindestens 100+ Beispiele fĂŒr komplexe Mappings
  4. Advanced Generation: Repetition penalty, beam search, early stopping
  5. Training Volume: 40+ Epochen fĂŒr Task Learning

:cross_mark: HĂ€ufige Fallstricke:

  1. Deprecated Parameter: Neue APIs nicht immer stabiler
  2. Fresh vs Continue: Continue Training kann besser sein als Fresh
  3. Cache/Memory Issues: Fresh Environment löst viele Probleme
  4. Generation Parameters: Standard-Parameter oft unzureichend
  5. Dataset Size: Zu kleine Datasets fĂŒhren zu Overfitting/Repetition

:brain: Debugging-Strategien (Lessons Learned)

1. Systematische Isolation

  • Eine Variable zur Zeit Ă€ndern
  • Von funktionierender Basis ausgehen
  • VorwĂ€rts-Debugging statt RĂŒckwĂ€rts-Raten

2. Pipeline-orientierte Diagnose

Input → Tokenization → Attention → Training → Generation → Output
   ✅        ✅           ✅         ❌         ❌        ❌

Systematisch jeden Schritt einzeln testen

3. Fresh Environment als Debugging-Tool

  • Cache/Memory-Issues eliminieren
  • Clean State fĂŒr reproduzierbare Tests
  • Controlled Experiments ermöglichen

4. Parameter-InstabilitÀt erkennen

  • Deprecated Warnings ernst nehmen
  • Cross-Pattern Recognition zwischen verschiedenen Fehlern
  • Conservative Parameter Choice bei Unsicherheit

5. Model-spezifische Anforderungen verstehen

  • T5 braucht Task Prefix fĂŒr neue Tasks
  • Encoder-Decoder Models haben spezielle Anforderungen
  • Generation Parameters sind kritisch fĂŒr Output-QualitĂ€t

:bullseye: Finale Erkenntnisse

Was funktioniert hat:

  1. Notfallmedizin-Debugging-Prinzipien → ML Engineering
  2. Systematische Differential-Diagnose → Bug Isolation
  3. “Better safe than sorry” → Conservative Development
  4. Fresh Environment Strategy → Clean Testing
  5. Cross-Pattern Recognition → Root Cause Analysis

Performance Metriken:

  • Training Loss: 9.9 → 0.009 (99.9% Improvement)
  • Test Accuracy: 100% auf 5 verschiedenen medizinischen Cases
  • Fachbereich-Abdeckung: 19 medizinische SpecialitĂ€ten
  • Debugging-Zeit: ~3 Stunden systematischer Analyse

:rocket: NĂ€chste Entwicklungsschritte

Mögliche Erweiterungen:

  1. Differentialdiagnosen hinzufĂŒgen
  2. Confidence Scoring implementieren
  3. Validation Set fĂŒr Overfitting-PrĂ€vention
  4. GrĂ¶ĂŸeres Model (T5-base/large) fĂŒr komplexere Cases
  5. Real-world Medical Data Integration

Deployment-Überlegungen:

  • Model Versioning fĂŒr verschiedene SpecialitĂ€ten
  • API Wrapper fĂŒr klinische Integration
  • Safety Measures fĂŒr medizinische Anwendungen
  • Continuous Learning aus neuen Cases

:light_bulb: Key Takeaways fĂŒr ML Engineering

1. Debugging ist ein systematischer Prozess

Nicht raten, sondern methodisch testen

2. Domain Knowledge + Technical Skills = Erfolg

Medizinische Expertise + ML Engineering = Powerful Combination

3. Fresh Environment ist ein mÀchtiges Tool

“Turn it off and on again” funktioniert auch bei ML

4. Conservative Parameter Choice zahlt sich aus

Alte, stabile Parameter > neue, instabile Parameter

5. Model-spezifische Anforderungen sind kritisch

T5, BERT, GPT haben alle verschiedene Best Practices


:trophy: Projekt-Erfolg

Von einem nicht-funktionierenden “True”-Bug zu einem 100% akkuraten medizinischen LLM in systematischen Debugging-Schritten.

Beweis: Systematische Herangehensweise + Domain-Expertise + Technische Umsetzung = Erfolgreiche ML-Lösung


Dieses Dokument zeigt, wie echte ML-Probleme in der Praxis gelöst werden: Nicht durch GlĂŒck oder Intuition, sondern durch systematische Analyse, methodisches Testen und schrittweise Problemlösung.

here is the English Version:
LLM Fine-tuning Debugging Guide: Systematic Problem Solving in Practice

A complete walkthrough from the first problem to a working medical LLM


:bullseye: Project Goal

Develop a medical LLM for diagnostic support using T5 fine-tuning.


:clipboard: Initial Situation

Original Code (functional but limited)

import pandas as pd
import transformers
import torch
from transformers import T5Tokenizer, T5ForConditionalGeneration
from datasets import Dataset
from transformers import DataCollatorForSeq2Seq, Trainer, TrainingArguments

data = [
    {"input": "Symptoms: Fever, cough. CRP: 67. Imaging: Infiltrate basal right. What is the most likely diagnosis?", "output": "Pneumonia"},
    {"input": "Symptoms: Dyspnea, left leg swelling. D-Dimer elevated. What is the most likely diagnosis?", "output": "Pulmonary embolism"},
    {"input": "Symptoms: Fatigue, pallor. Hb: low. What is the most likely diagnosis?", "output": "Anemia"},
    {"input": "Symptoms: Chest pain, high troponin, EKG ST-elevation. What is the most likely diagnosis?", "output": "Myocardial infarction"},
    {"input": "Symptoms: Polyuria, polydipsia, blood glucose 320 mg/dl. What is the most likely diagnosis?", "output": "Diabetes mellitus"}
]
data = pd.DataFrame(data)
tokenizer = T5Tokenizer.from_pretrained("t5-small")

def tokenize(example):
    input_enc = tokenizer(example["input"], truncation=True, padding="max_length", max_length=128)
    output_enc = tokenizer(example["output"], truncation=True, padding="max_length", max_length=32)
    input_enc["labels"] = output_enc["input_ids"]
    return input_enc

dataset = Dataset.from_pandas(data)
tokenized_dataset = dataset.map(tokenize)
model = T5ForConditionalGeneration.from_pretrained("t5-small")

training_args = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=2,
    num_train_epochs=20,
    logging_steps=1,
    save_strategy="no",
    report_to="none"
)

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator
)
trainer.train()

def predict(prompt):
    inputs = tokenizer(prompt, return_tensors="pt", padding=True, truncation=True).input_ids
    outputs = model.generate(inputs, max_length=32)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

test_prompt = "Symptoms: Shortness of breath, fever, CRP 90, X-ray: Infiltrate right. What is the most likely diagnosis?"
print("Answer:", predict(test_prompt))

Initial Results (problematic but functional)

  • Output: "Pneumonia. DD: Pneumonia, Pneumonia" (repetitive)
  • Loss: 8.78 → 0.43 (very good)
  • Problem: Repetitive/incorrect differential diagnoses

:police_car_light: Problem Phase 1: Structural Improvement Leads to “True” Bug

Attempt: Implementing Extended Features

Goal: 100 examples, validation split, better output structure
Changes:

  • Dataset expanded to 100 examples
  • Structured DD output: "Diagnosis: X | DD: Y, Z, W"
  • Train/validation split (80/20)
  • as_target_tokenizer() → text_target (deprecated fix)
  • tokenizer → processing_class parameter

Problem: “True” Bug

# Expected: "Pneumonia. DD: Bronchitis, Pleuritis"
# Actual: "True"

Symptoms:

  • All outputs only "True"
  • Model behaves like a binary classifier
  • Missing keys warning: embed_tokens.weight, lm_head.weight

:magnifying_glass_tilted_left: Debugging Phase 1: Systematic Problem Identification

Step 1: Parameter Instability Hypothesis

Observation: Multiple deprecated/new parameters changed simultaneously

  • evaluation_strategy → TypeError
  • processing_class vs tokenizer
  • text_target vs as_target_tokenizer()

Hypothesis: New parameters are unstable; old parameters work better

Step 2: Stepwise Rollback

Strategy: Change one variable at a time

Test 1: as_target_tokenizer() Fix

# Revert to deprecated but functional method
with tokenizer.as_target_tokenizer():
    output_enc = tokenizer(example["output"], ...)

Result: "rmelkinese" (corrupt, but no longer “True”)

Test 2: Original vs Fix Comparison

Result: Both times "rmelkinese" → Problem lies elsewhere


:broom: Debugging Phase 2: Fresh Environment Strategy

Step 3: Clean Slate Approach

Decision: Fresh notebook, back to functional baseline
Baseline Test (5 examples, original code):

# Minimal test for root cause isolation
data = [original 5 examples without DD]

Result: "What is the most likely diagnosis?" (input echo)


:microscope: Debugging Phase 3: Pipeline Diagnosis

Step 4: Labels Debug

Check: Are labels correctly tokenized?

print("Sample tokenized data:")
print(f"Labels: {tokenized_dataset[0]['labels'][:10]}")
print(f"Decoded Labels: {tokenizer.decode(tokenized_dataset[0]['labels'])}")

Result: :white_check_mark: Labels perfect: "Pneumonia</s><pad>..."

Step 5: Attention Mask Debug

Check: Does the attention mechanism work?

inputs = tokenizer(prompt, return_tensors="pt", padding=True, truncation=True, max_length=128)
print(f"Attention mask: {inputs.attention_mask}")
print(f"Attention mask sum: {inputs.attention_mask[0].sum()}")

Result: :white_check_mark: Attention perfect: 36/36 tokens attended

Step 6: EOS/PAD Token Debug

Check: Is token handling correct?

print(f"PAD token: '{tokenizer.pad_token}' -> ID: {tokenizer.pad_token_id}")
print(f"EOS token: '{tokenizer.eos_token}' -> ID: {tokenizer.eos_token_id}")

Result: :white_check_mark: Token setup correct, but generation produces input echo


:police_car_light: Problem Phase 2: DataCollator Crash

Step 7: Label-Training-Pipeline Debug

Deeper Test: What happens during training?
CRASH:

ValueError: Unable to create tensor... Perhaps your features (`input` in this case) have excessive nesting

Root Cause: String Features in Dataset

Problem: DataCollator cannot tensorize all features

tokenized_dataset.features = {
    "input": "string",      # ❌ DataCollator crash
    "output": "string",     # ❌ DataCollator crash
    "input_ids": "tensor",  # ✅ OK
    "labels": "tensor"      # ✅ OK
}

Fix: Remove String Features

tokenized_dataset = tokenized_dataset.remove_columns(["input", "output"])

Result: Training runs, but output still incorrect


:magnifying_glass_tilted_left: Debugging Phase 4: T5-Specific Problems

Step 8: T5 Training Mode Check

Check: Does T5 understand our task?
Discovery: T5 has task-specific parameters:

model.config.task_specific_params = {
    'summarization': {'prefix': 'summarize: '},
    'translation_en_to_de': {'prefix': 'translate English to German: '},
    ...
}

Problem: Without task prefix, T5 doesn’t know what to do!

Step 9: Task Prefix Implementation

def tokenize_with_task_prefix(example):
    task_prefixed_input = f"medical diagnosis: {example['input']}"
    input_enc = tokenizer(task_prefixed_input, truncation=True, padding="max_length", max_length=128)
    output_enc = tokenizer(example["output"], truncation=True, padding="max_length", max_length=32)
    input_enc["labels"] = output_enc["input_ids"]
    return input_enc

Result: Input echo stops, but only empty outputs


:police_car_light: Problem Phase 3: PAD Token Loop

Step 10: Generation Mechanism Debug

Problem: Model generates only PAD tokens [0,0,0,...]
Deep Debug:

# Raw token analysis
outputs = model.generate(inputs, max_length=32, do_sample=False)
print(f"Raw tokens: {outputs[0]}")
# Result: [0, 0, 0, 0, 0, 0, ...]

Hypothesis: Training Volume vs Decoder Mechanism

Discussion:

  • Are 10 epochs too few for task prefix learning?
  • Or is the decoder-start mechanism broken?

Step 11: A/B Test Strategy

Test 1: Continue Training (+20 epochs)
Test 2: Fresh Training (30 epochs from scratch)

Continue Training Result:

  • Loss: 2.0 → 0.15-0.30
  • Output: "Morbus Morbus Morbus..." :white_check_mark: (medical terms, but repetitive)

Fresh Training Result:

  • Loss: 10.1 → 0.30-0.85
  • Output: "" (empty, PAD tokens)

Conclusion: Continue training is better than fresh!


:bullseye: Breakthrough Phase: Generation Parameter Optimization

Step 12: Improved Generation Parameters

Problem: Repetitive output ("Morbus Morbus Morbus...")
Solution: Advanced generation parameters

def predict_improved(prompt):
    prefixed_prompt = f"medical diagnosis: {prompt}"
    inputs = tokenizer(prefixed_prompt, return_tensors="pt", padding=True, truncation=True)

    outputs = model.generate(
        input_ids=inputs.input_ids,
        attention_mask=inputs.attention_mask,
        max_new_tokens=32,
        repetition_penalty=2.0,    # ← Anti-repetition
        num_beams=4,               # ← Better quality
        early_stopping=True,       # ← Stop at EOS
        eos_token_id=tokenizer.eos_token_id
    )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

Breakthrough Results:

  • Input: "Symptoms: Shortness of breath, fever, CRP 90..."
  • Output: "Shortness of breath, fever, CRP 90, X-ray" :white_check_mark:
    Analysis: Model extracts relevant medical information, but no diagnosis yet!

:rocket: Final Success Phase: Scale & Training Optimization

Step 13: Dataset & Training Scale-Up

Strategy: More data + intensive training
Scaling:

  • 25 → 160 examples (6x more data)
  • 30 → 40 epochs (more training)
  • 19 medical specialties covered

Optimized Training Parameters:

training_args = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=4,  # Larger batches
    num_train_epochs=40,            # More epochs
    learning_rate=3e-4,             # Optimized LR
    warmup_steps=50,                # Warmup for stability
    logging_steps=10,
    save_strategy="no",
    report_to="none"
)

Final Training Results:

  • Loss: 9.9 → 0.009 (Outstanding!)
  • 160 examples successfully trained
  • 40 epochs with perfect convergence

:trophy: SUCCESS: Functional Medical LLM

Final Test Results (100% Success Rate):

Final Test Results

Test Input Generated Expected Status
1 Fever, cough, infiltrate Pneumonia Pneumonia :white_check_mark:
2 Chest pain, troponin, ST-elevation Myocardial infarction Myocardial infarction :white_check_mark:
3 Polyuria, blood glucose 320 mg/dl Diabetes mellitus Diabetes mellitus :white_check_mark:
4 Tremor, rigidity, bradykinesia Parkinson’s disease Parkinson’s disease :white_check_mark:
5 Headache, meningismus Meningitis Meningitis :white_check_mark:

:clipboard: Debugging Steps Summary

:magnifying_glass_tilted_left: Systematic Problem Identification

  1. Parameter Instability Analysis
    • Cross-pattern recognition between various deprecated warnings
    • Isolation of individual parameter changes
  2. Pipeline Component Test
    • Labels tokenization :white_check_mark:
    • Attention mask :white_check_mark:
    • EOS/PAD token handling :white_check_mark:
    • DataCollator :cross_mark: → FIXED
  3. T5-Specific Requirements
    • Task prefix requirement identified
    • Encoder-decoder pipeline understood
  4. Generation Mechanism Optimization
    • Parameter tuning for anti-repetition
    • Beam search for better quality
  5. Scale & Training Optimization
    • Dataset size as a critical factor
    • Training volume for complex tasks

:hammer_and_wrench: Final Code Solution

import pandas as pd
import transformers
import torch
from transformers import T5Tokenizer, T5ForConditionalGeneration
from datasets import Dataset
from transformers import DataCollatorForSeq2Seq, Trainer, TrainingArguments

# LARGE DATABASE: 160 medical examples
data = [
    # ... [160 examples from 19 specialties]
]

tokenizer = T5Tokenizer.from_pretrained("t5-small")

# T5 TASK PREFIX (critical for T5 performance)
def tokenize_with_task_prefix(example):
    task_prefixed_input = f"medical diagnosis: {example['input']}"
    input_enc = tokenizer(task_prefixed_input, truncation=True, padding="max_length", max_length=128)
    output_enc = tokenizer(example["output"], truncation=True, padding="max_length", max_length=32)
    input_enc["labels"] = output_enc["input_ids"]
    return input_enc

dataset = Dataset.from_pandas(data)
tokenized_dataset = dataset.map(tokenize_with_task_prefix)

# DATACOLLATOR FIX: Remove string features
tokenized_dataset = tokenized_dataset.remove_columns(["input", "output"])

model = T5ForConditionalGeneration.from_pretrained("t5-small")

# OPTIMIZED TRAINING PARAMETERS
training_args = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=4,
    num_train_epochs=40,
    learning_rate=3e-4,
    warmup_steps=50,
    logging_steps=10,
    save_strategy="no",
    report_to="none"
)

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator
)
trainer.train()

# OPTIMIZED PREDICTION FUNCTION
def predict_medical_diagnosis(prompt):
    prefixed_prompt = f"medical diagnosis: {prompt}"
    inputs = tokenizer(prefixed_prompt, return_tensors="pt", padding=True, truncation=True)

    outputs = model.generate(
        input_ids=inputs.input_ids,
        attention_mask=inputs.attention_mask,
        max_new_tokens=32,
        repetition_penalty=2.0,    # Anti-repetition
        num_beams=4,               # Better quality
        early_stopping=True,       # Stop at EOS
        eos_token_id=tokenizer.eos_token_id
    )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# TEST
test_prompt = "Symptoms: Shortness of breath, fever, CRP 90, X-ray: Infiltrate right. What is the most likely diagnosis?"
result = predict_medical_diagnosis(test_prompt)
print(f"Diagnosis: {result}")  # Output: "Pneumonia"

:bar_chart: Critical Success Factors

:white_check_mark: Must-Have Components:

  1. T5 Task Prefix: "medical diagnosis: " - Essential for T5 understanding
  2. DataCollator Fix: Remove string features
  3. Sufficient Data: At least 100+ examples for complex mappings
  4. Advanced Generation: Repetition penalty, beam search, early stopping
  5. Training Volume: 40+ epochs for task learning

:cross_mark: Common Pitfalls:

  1. Deprecated Parameters: New APIs not always more stable
  2. Fresh vs Continue: Continue training can be better than fresh
  3. Cache/Memory Issues: Fresh environment solves many problems
  4. Generation Parameters: Default parameters often insufficient
  5. Dataset Size: Too small datasets lead to overfitting/repetition

:brain: Debugging Strategies (Lessons Learned)

1. Systematic Isolation

  • Change one variable at a time
  • Start from a working baseline
  • Forward debugging, not backward guessing

2. Pipeline-Oriented Diagnosis

Input → Tokenization → Attention → Training → Generation → Output
   ✅        ✅           ✅         ❌         ❌        ❌

Test each step individually

3. Fresh Environment as a Debugging Tool

  • Eliminate cache/memory issues
  • Enable clean state for reproducible tests
  • Allow controlled experiments

4. Recognize Parameter Instability

  • Take deprecated warnings seriously
  • Cross-pattern recognition between different errors
  • Choose conservative parameters when in doubt

5. Understand Model-Specific Requirements

  • T5 needs task prefix for new tasks
  • Encoder-decoder models have special requirements
  • Generation parameters are critical for output quality

:bullseye: Final Insights

What Worked:

  1. Emergency medicine debugging principles → ML engineering
  2. Systematic differential diagnosis → Bug isolation
  3. “Better safe than sorry” → Conservative development
  4. Fresh environment strategy → Clean testing
  5. Cross-pattern recognition → Root cause analysis

Performance Metrics:

  • Training Loss: 9.9 → 0.009 (99.9% improvement)
  • Test Accuracy: 100% on 5 different medical cases
  • Specialty Coverage: 19 medical specialties
  • Debugging Time: ~3 hours of systematic analysis

:rocket: Next Development Steps

Possible Extensions:

  1. Add differential diagnoses
  2. Implement confidence scoring
  3. Validation set for overfitting prevention
  4. Larger model (T5-base/large) for complex cases
  5. Real-world medical data integration

Deployment Considerations:

  • Model versioning for different specialties
  • API wrapper for clinical integration
  • Safety measures for medical applications
  • Continuous learning from new cases

:light_bulb: Key Takeaways for ML Engineering

1. Debugging is a systematic process

Don’t guess, test methodically

2. Domain Knowledge + Technical Skills = Success

Medical expertise + ML engineering = Powerful combination

3. Fresh Environment is a Powerful Tool

“Turn it off and on again” works for ML too

4. Conservative Parameter Choice Pays Off

Old, stable parameters > new, unstable parameters

5. Model-Specific Requirements are Critical

T5, BERT, GPT each have different best practices


:trophy: Project Success

From a non-functional “True” bug to a 100% accurate medical LLM in systematic debugging steps.

Proof: Systematic approach + domain expertise + technical implementation = Successful ML solution

Final loss: 0.009; accuracy on our small, hand-crafted test set was 100%, likely due to the limited dataset and clear-cut labels (e.g., Pneumonia, Myocardial infarction). This should not be interpreted as clinical performance.


*This document shows how real ML problems are solved in practice: Not by luck or intuition, but by systematic analysis, methodical testing, and step-by-step problem solving.

For more info check out my GitHub repos* KatharinaJacoby (Katharina) · GitHub