We present Polyglot-Lion, a family of compact multilingual automatic speech
recognition (ASR) models tailored for the linguistic landscape of Singapore, covering English,
Mandarin, Tamil, and Malay. Our models are obtained by fine-tuning Qwen3-ASR-0.6B and
Qwen3-ASR-1.7B exclusively on publicly available speech corpora, using a balanced
sampling strategy that equalises the number of training utterances per language and deliberately
omits language-tag conditioning so that the model learns to identify languages implicitly from
audio. On 12 benchmarks spanning the four target languages, Polyglot-Lion-1.7B
achieves an average error rate of 14.85,
competitive with MERaLiON-2-10B-ASR (14.32) — a model 6× larger — while incurring a training cost of
$81 on a single RTX PRO 6000 GPU compared to
$18,862 for the 128-GPU baseline. Inference throughput is approximately 20× faster at 0.10 s/sample versus 2.02 s/sample.
Polyglot-Lion achieves competitive or best-in-class WER and CER across all four languages, outperforming models up to 6× larger.
| Model | Params | English | Mandarin (CER) | Tamil (WER) | Malay | Avg | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| LS | NSC | CV | AISH1 | AISH3 | Fleurs | CV | SLR65 | SLR127 | Fleurs | Meso. | Fleurs | |||
| Whisper-large-v3-turbo | 0.8B | 3.04 | 32.02 | 17.91 | 9.64 | 16.81 | 10.63 | 74.50 | 58.13 | 69.56 | 66.90 | 28.47 | 8.88 | 33.04 |
| SeaLLMs-Audio-7B | 7B | 94.74 | 9.53 | 8.68 | 9.65 | 9.76 | 37.09 | 126.70 | 127.24 | 138.65 | 105.31 | 71.34 | 26.25 | 63.75 |
| Qwen2.5-Omni-3B | 3B | 29.21 | 34.79 | 46.36 | 28.25 | 44.55 | 54.74 | 318.36 | 465.58 | 448.82 | 311.67 | 211.90 | 74.69 | 172.37 |
| Qwen2.5-Omni-7B | 7B | 13.80 | 22.96 | 14.49 | 7.33 | 22.58 | 16.68 | 252.06 | 239.15 | 303.96 | 326.43 | 158.06 | 43.92 | 118.45 |
| Qwen3-ASR-0.6B | 0.6B | 2.74 | 7.64 | 10.06 | 2.08 | 2.59 | 9.75 | 121.10 | 127.00 | 129.12 | 130.09 | 47.29 | 18.71 | 50.68 |
| Qwen3-ASR-1.7B | 1.7B | 2.31 | 6.22 | 7.50 | 1.52 | 2.08 | 9.33 | 139.96 | 134.63 | 144.49 | 147.23 | 39.00 | 10.87 | 53.76 |
| MERaLiON-2-10B-ASR | 10B | 2.54 | 4.62 | 8.83 | 3.09 | 4.07 | 11.99 | 31.78 | 19.29 | 22.42 | 28.68 | 25.90 | 8.55 | 14.32 |
| ★ Polyglot-Lion-0.6B | 0.6B | 2.67 | 6.09 | 6.16 | 1.93 | 2.32 | 9.19 | 42.16 | 23.07 | 28.14 | 37.68 | 24.33 | 14.45 | 16.52 |
| ★ Polyglot-Lion-1.7B | 1.7B | 2.10 | 5.28 | 4.91 | 1.45 | 1.86 | 8.00 | 39.19 | 19.75 | 26.83 | 37.28 | 21.51 | 9.98 | 14.85 |
WER (%) for English, Tamil, and Malay; CER (%) for Mandarin. Lower is better. Gold = best overall. ★ = our models.
To handle severe class imbalance across languages and datasets, we introduce a two-stage deterministic upsampling strategy.
Within each language, smaller datasets are replicated and subsampled to match the largest dataset in that language group, ensuring equal representation of all corpora within a single language.
Across languages, per-language corpora are balanced so every language contributes exactly 25% to the final training set — guaranteeing equal coverage regardless of data availability.
No language tags are used at training or inference time. The model identifies the spoken language implicitly from audio alone — ideal for Singapore's code-switching environment.
| MERaLiON-2-10B | Polyglot-Lion | |
|---|---|---|
| Training Data | 120,000 h | 783 h |
| Hardware | 128 × H100 | 1 × RTX PRO 6000 |
| Training Time | 48 h | 48 h |
| Est. Cost | $18,862 | $81 |
| Model | Time (s/sample) |
|---|---|
| MERaLiON-2-10B-ASR | 2.0152 ± 0.8846 |
| Qwen2.5-Omni-3B | 1.7838 ± 1.0431 |
| Qwen2.5-Omni-7B | 1.3414 ± 0.6572 |
| SeaLLMs-Audio-7B | 0.6422 ± 0.0000 |
| Whisper-large-v3-turbo | 0.2822 ± 0.0230 |
| Qwen3-ASR-1.7B | 0.0809 ± 0.0290 |
| Qwen3-ASR-0.6B | 0.0686 ± 0.0251 |
| ★ Polyglot-Lion-0.6B | 0.0999 ± 0.0561 |
| ★ Polyglot-Lion-1.7B | 0.1038 ± 0.0621 |
@misc{dang2026polyglotlion,
title={Polyglot-Lion: Efficient Multilingual ASR for Singapore via Balanced Fine-Tuning of Qwen3-ASR},
author={Quy-Anh Dang and Chris Ngo},
year={2026},
eprint={2603.16184},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2603.16184},
}
Polyglot-Lion is an open research effort, and we warmly welcome contributions from the community. Whether you can provide multilingual speech datasets, GPU compute resources, code contributions, or evaluation expertise — we would love to hear from you. Together we can push the boundaries of accessible, efficient multilingual ASR.