Polyglot-Lion: Efficient Multilingual ASR for Singapore via Balanced Fine-Tuning of Qwen3-ASR

^* Corresponding Author Knovel Engineering Lab, Singapore

Abstract

We present Polyglot-Lion, a family of compact multilingual automatic speech recognition (ASR) models tailored for the linguistic landscape of Singapore, covering English, Mandarin, Tamil, and Malay. Our models are obtained by fine-tuning Qwen3-ASR-0.6B and Qwen3-ASR-1.7B exclusively on publicly available speech corpora, using a balanced sampling strategy that equalises the number of training utterances per language and deliberately omits language-tag conditioning so that the model learns to identify languages implicitly from audio. On 12 benchmarks spanning the four target languages, Polyglot-Lion-1.7B achieves an average error rate of 14.85, competitive with MERaLiON-2-10B-ASR (14.32) — a model 6× larger — while incurring a training cost of $81 on a single RTX PRO 6000 GPU compared to $18,862 for the 128-GPU baseline. Inference throughput is approximately 20× faster at 0.10 s/sample versus 2.02 s/sample.

Benchmark Results

State-of-the-Art Accuracy at a Fraction of the Size

Polyglot-Lion achieves competitive or best-in-class WER and CER across all four languages, outperforming models up to 6× larger.

Average error rate comparison across models

Model	Params	English		Mandarin (CER)				Tamil (WER)				Malay		Avg
		LS	NSC	CV	AISH1	AISH3	Fleurs	CV	SLR65	SLR127	Fleurs	Meso.	Fleurs
Whisper-large-v3-turbo	0.8B	3.04	32.02	17.91	9.64	16.81	10.63	74.50	58.13	69.56	66.90	28.47	8.88	33.04
SeaLLMs-Audio-7B	7B	94.74	9.53	8.68	9.65	9.76	37.09	126.70	127.24	138.65	105.31	71.34	26.25	63.75
Qwen2.5-Omni-3B	3B	29.21	34.79	46.36	28.25	44.55	54.74	318.36	465.58	448.82	311.67	211.90	74.69	172.37
Qwen2.5-Omni-7B	7B	13.80	22.96	14.49	7.33	22.58	16.68	252.06	239.15	303.96	326.43	158.06	43.92	118.45
Qwen3-ASR-0.6B	0.6B	2.74	7.64	10.06	2.08	2.59	9.75	121.10	127.00	129.12	130.09	47.29	18.71	50.68
Qwen3-ASR-1.7B	1.7B	2.31	6.22	7.50	1.52	2.08	9.33	139.96	134.63	144.49	147.23	39.00	10.87	53.76
MERaLiON-2-10B-ASR	10B	2.54	4.62	8.83	3.09	4.07	11.99	31.78	19.29	22.42	28.68	25.90	8.55	14.32
★ Polyglot-Lion-0.6B	0.6B	2.67	6.09	6.16	1.93	2.32	9.19	42.16	23.07	28.14	37.68	24.33	14.45	16.52
★ Polyglot-Lion-1.7B	1.7B	2.10	5.28	4.91	1.45	1.86	8.00	39.19	19.75	26.83	37.28	21.51	9.98	14.85

WER (%) for English, Tamil, and Malay; CER (%) for Mandarin. Lower is better. Gold = best overall. ★ = our models.

Method

Two-Stage Balanced Multilingual Upsampling

To handle severe class imbalance across languages and datasets, we introduce a two-stage deterministic upsampling strategy.

🔄 Stage 1 — Intra-Language Balancing

Within each language, smaller datasets are replicated and subsampled to match the largest dataset in that language group, ensuring equal representation of all corpora within a single language.

🌐 Stage 2 — Inter-Language Balancing

Across languages, per-language corpora are balanced so every language contributes exactly 25% to the final training set — guaranteeing equal coverage regardless of data availability.

🔇 Language-Agnostic Decoding

No language tags are used at training or inference time. The model identifies the spoken language implicitly from audio alone — ideal for Singapore's code-switching environment.

📋 Algorithm

Training Data — 4 Languages, 12 Datasets, ~969 Hours

Efficiency

Dramatically Lower Cost, Dramatically Higher Speed

💰 Training Cost Comparison

	MERaLiON-2-10B	Polyglot-Lion
Training Data	120,000 h	783 h
Hardware	128 × H100	1 × RTX PRO 6000
Training Time	48 h	48 h
Est. Cost	$18,862	$81

⚡ Inference Speed

Model	Time (s/sample)
MERaLiON-2-10B-ASR	2.0152 ± 0.8846
Qwen2.5-Omni-3B	1.7838 ± 1.0431
Qwen2.5-Omni-7B	1.3414 ± 0.6572
SeaLLMs-Audio-7B	0.6422 ± 0.0000
Whisper-large-v3-turbo	0.2822 ± 0.0230
Qwen3-ASR-1.7B	0.0809 ± 0.0290
Qwen3-ASR-0.6B	0.0686 ± 0.0251
★ Polyglot-Lion-0.6B	0.0999 ± 0.0561
★ Polyglot-Lion-1.7B	0.1038 ± 0.0621

Citation

Cite Our Work

@misc{dang2026polyglotlion,
    title={Polyglot-Lion: Efficient Multilingual ASR for Singapore via Balanced Fine-Tuning of Qwen3-ASR}, 
    author={Quy-Anh Dang and Chris Ngo},
    year={2026},
    eprint={2603.16184},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    url={https://arxiv.org/abs/2603.16184}, 
}

Get Involved

Help Us Build Better Multilingual ASR

Polyglot-Lion is an open research effort, and we warmly welcome contributions from the community. Whether you can provide multilingual speech datasets, GPU compute resources, code contributions, or evaluation expertise — we would love to hear from you. Together we can push the boundaries of accessible, efficient multilingual ASR.

✉️ Reach Out to Collaborate