YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Curriculum Learning Γ— Temperature Sampling for Multilingual ASR

A literature-grounded framework for finding the optimal data scheduling strategy for multilingual ASR training across three resource tiers.

Problem Statement

Given multilingual ASR data spanning:

  • Low-resource languages: 100-150 hours
  • Mid-resource languages: 500-1500 hours
  • High-resource languages: 3000-8000 hours

Find the optimal combination of temperature-based sampling and curriculum learning that maximizes WER/CER across ALL tiers without sacrificing high-resource performance.

Key Finding: 3-Phase Easy-to-Hard Curriculum

After evaluating 11 strategies across 2000+ configurations with robustness validation across 10 random seeds, the optimal approach is:

Phase 1: HRL Foundation (0% β†’ 25% of training)

  • Languages: HIGH-resource only (3000-8000h)
  • Sampling: Ο„ = 1.0 (proportional within HRL)
  • Purpose: Build robust encoder representations

Phase 2: HRL + MRL Expansion (25% β†’ 55% of training)

  • Languages: HIGH + MID resource (500-8000h)
  • Sampling: Ο„ = 2.0 (moderate upsampling of MRL)
  • Purpose: Extend representations, begin cross-lingual transfer

Phase 3: Full Multilingual (55% β†’ 100%)

  • Languages: ALL (100-8000h)
  • Sampling: Ο„ = 3.0 - 3.33 (upsampling of LRL)
  • Purpose: Train LRL while maintaining HRL+MRL quality
  • Epoch cap: max 5 repetitions per LRL language

Literature Foundations

Paper Key Contribution Year
UniMax Epoch-capped uniform sampling prevents overfitting 2023
Cooldown Dynamic τ: high→low achieves best of both worlds 2024
MMS LSAH adapters eliminate language confusion at scale 2023
Whisper WER halves every 16Γ— data increase (log-log linear) 2022
Google USM MOST curriculum: staged data introduction 2023
Scaling Laws L_i = L*_i Β· p_i^(-Ξ³_i) power law per language family 2024
CL Pretraining Pacing functions, interleaved CL with difficulty metrics 2025

Sensitivity Analysis Results

Phase 3 Temperature (Ο„) β€” Most Impactful Parameter

Ο„ LRL WER MRL WER HRL WER Harmonic Mean
1.0 7.84 3.34 0.57 1.38
2.0 7.51 3.10 0.58 1.38
3.33 7.22 3.05 0.59 1.39
5.0 7.04 3.03 0.60 1.40
10.0 6.84 3.04 0.61 1.41

Key insight: Higher Ο„ monotonically improves LRL but degrades HRL. The harmonic mean optimum is at Ο„β‰ˆ2.0-3.33, confirming the mT5/XLM-R choice.

HRL Warmup Duration β€” Sweet Spot at 25-30%

Warmup LRL WER HRL WER Harmonic Mean
0% 8.54 0.77 1.76
15% 7.72 0.63 1.47
25% 7.22 0.59 1.39
30% 7.03 0.58 1.36
40% 7.53 0.57 1.38
50% 8.08 0.57 1.39

Key insight: Too little warmup β†’ poor HRL. Too much β†’ LRL starved. Optimal at 25-30%.

Files

  • curriculum_temperature_framework.py β€” Main simulation framework (11 strategies, performance model, visualizations)
  • optimal_strategy_deep_analysis.py β€” Hyperparameter sweep (680+ curriculum configs, 192 hybrid configs)
  • final_training_recipe.txt β€” Production-ready training recipe with implementation pseudocode

Overfitting Prevention (Critical for LRL)

From UniMax paper: "Even repeating 0.1% of data 100 times can be as harmful as halving model size"

Tier Max Epochs Data Augmentation Effective Data
Low (100-150h) 5 Speed perturbation 3Γ—, SpecAugment aggressive ~450-750h effective
Mid (500-1500h) 10 SpecAugment moderate ~5000-15000h effective
High (3000-8000h) 2 SpecAugment light ~6000-16000h effective
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Papers for StephennFernandes/multilingual-asr-curriculum-strategy