Polyglot-Lion: Efficient Multilingual ASR for Singapore via Balanced Fine-Tuning of Qwen3-ASR

Quy-Anh Dang* , Chris Ngo

* Corresponding Author    Knovel Engineering Lab, Singapore

$81
Training Cost
233×
Cheaper than MERaLiON
20×
Faster Inference
14.85
Avg Error Rate

Abstract

We present Polyglot-Lion, a family of compact multilingual automatic speech recognition (ASR) models tailored for the linguistic landscape of Singapore, covering English, Mandarin, Tamil, and Malay. Our models are obtained by fine-tuning Qwen3-ASR-0.6B and Qwen3-ASR-1.7B exclusively on publicly available speech corpora, using a balanced sampling strategy that equalises the number of training utterances per language and deliberately omits language-tag conditioning so that the model learns to identify languages implicitly from audio. On 12 benchmarks spanning the four target languages, Polyglot-Lion-1.7B achieves an average error rate of 14.85, competitive with MERaLiON-2-10B-ASR (14.32) — a model 6× larger — while incurring a training cost of $81 on a single RTX PRO 6000 GPU compared to $18,862 for the 128-GPU baseline. Inference throughput is approximately 20× faster at 0.10 s/sample versus 2.02 s/sample.

State-of-the-Art Accuracy at a Fraction of the Size

Polyglot-Lion achieves competitive or best-in-class WER and CER across all four languages, outperforming models up to 6× larger.

Average error rate comparison across models
Model Params English Mandarin (CER) Tamil (WER) Malay Avg
LS NSC CV AISH1 AISH3 Fleurs CV SLR65 SLR127 Fleurs Meso. Fleurs
Whisper-large-v3-turbo 0.8B 3.04 32.02 17.91 9.64 16.81 10.63 74.50 58.13 69.56 66.90 28.47 8.88 33.04
SeaLLMs-Audio-7B 7B 94.74 9.53 8.68 9.65 9.76 37.09 126.70 127.24 138.65 105.31 71.34 26.25 63.75
Qwen2.5-Omni-3B 3B 29.21 34.79 46.36 28.25 44.55 54.74 318.36 465.58 448.82 311.67 211.90 74.69 172.37
Qwen2.5-Omni-7B 7B 13.80 22.96 14.49 7.33 22.58 16.68 252.06 239.15 303.96 326.43 158.06 43.92 118.45
Qwen3-ASR-0.6B 0.6B 2.74 7.64 10.06 2.08 2.59 9.75 121.10 127.00 129.12 130.09 47.29 18.71 50.68
Qwen3-ASR-1.7B 1.7B 2.31 6.22 7.50 1.52 2.08 9.33 139.96 134.63 144.49 147.23 39.00 10.87 53.76
MERaLiON-2-10B-ASR 10B 2.54 4.62 8.83 3.09 4.07 11.99 31.78 19.29 22.42 28.68 25.90 8.55 14.32
★ Polyglot-Lion-0.6B 0.6B 2.67 6.09 6.16 1.93 2.32 9.19 42.16 23.07 28.14 37.68 24.33 14.45 16.52
★ Polyglot-Lion-1.7B 1.7B 2.10 5.28 4.91 1.45 1.86 8.00 39.19 19.75 26.83 37.28 21.51 9.98 14.85

WER (%) for English, Tamil, and Malay; CER (%) for Mandarin. Lower is better. Gold = best overall. ★ = our models.

Two-Stage Balanced Multilingual Upsampling

To handle severe class imbalance across languages and datasets, we introduce a two-stage deterministic upsampling strategy.

🔄 Stage 1 — Intra-Language Balancing

Within each language, smaller datasets are replicated and subsampled to match the largest dataset in that language group, ensuring equal representation of all corpora within a single language.

🌐 Stage 2 — Inter-Language Balancing

Across languages, per-language corpora are balanced so every language contributes exactly 25% to the final training set — guaranteeing equal coverage regardless of data availability.

🔇 Language-Agnostic Decoding

No language tags are used at training or inference time. The model identifies the spoken language implicitly from audio alone — ideal for Singapore's code-switching environment.

📋 Algorithm

Two-stage balanced upsampling algorithm

Training Data — 4 Languages, 12 Datasets, ~969 Hours

Dataset statistics by language

Dramatically Lower Cost, Dramatically Higher Speed

💰 Training Cost Comparison

MERaLiON-2-10B Polyglot-Lion
Training Data 120,000 h 783 h
Hardware 128 × H100 1 × RTX PRO 6000
Training Time 48 h 48 h
Est. Cost $18,862 $81

⚡ Inference Speed

Model Time (s/sample)
MERaLiON-2-10B-ASR 2.0152 ± 0.8846
Qwen2.5-Omni-3B 1.7838 ± 1.0431
Qwen2.5-Omni-7B 1.3414 ± 0.6572
SeaLLMs-Audio-7B 0.6422 ± 0.0000
Whisper-large-v3-turbo 0.2822 ± 0.0230
Qwen3-ASR-1.7B 0.0809 ± 0.0290
Qwen3-ASR-0.6B 0.0686 ± 0.0251
★ Polyglot-Lion-0.6B 0.0999 ± 0.0561
★ Polyglot-Lion-1.7B 0.1038 ± 0.0621

Download on 🤗 Hugging Face

Polyglot-Lion-0.6B
0.6 Billion Parameters
16.52
Avg Error Rate
0.10s
Per Sample
🤗 Download Model
Polyglot-Lion-1.7B
1.7 Billion Parameters
14.85
Avg Error Rate
0.10s
Per Sample
🤗 Download Model

Cite Our Work

@misc{dang2026polyglotlion,
    title={Polyglot-Lion: Efficient Multilingual ASR for Singapore via Balanced Fine-Tuning of Qwen3-ASR}, 
    author={Quy-Anh Dang and Chris Ngo},
    year={2026},
    eprint={2603.16184},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    url={https://arxiv.org/abs/2603.16184}, 
}

Help Us Build Better Multilingual ASR

Polyglot-Lion is an open research effort, and we warmly welcome contributions from the community. Whether you can provide multilingual speech datasets, GPU compute resources, code contributions, or evaluation expertise — we would love to hear from you. Together we can push the boundaries of accessible, efficient multilingual ASR.

✉️ Reach Out to Collaborate