Alibaba’s Qwen3-ASR-Flash Outperforms Competition in Speech Transcription, Shines in Music Lyrics Recognition
Alibaba’s Qwen team has unveiled a new AI speech transcription model, Qwen3-ASR-Flash, that promises to revolutionize the field with its exceptional performance in various challenging environments.
Built on the robust Qwen3-Omni intelligence and trained using an extensive dataset of over 40 million hours of speech data, this model is designed to deliver unparalleled accuracy even in complex acoustic conditions or intricate language patterns.
In a series of tests conducted in August 2025, the model demonstrated impressive results, outperforming its competitors in terms of error rates. For standard Chinese, Qwen3-ASR-Flash achieved an error rate of just 3.97%, significantly lower than Gemini-2.5-Pro (8.98%) and GPT4o-Transcribe (15.72%).
In the realm of Chinese accents, the model performed admirably, recording an error rate of 3.48%. In English, it maintained a competitive edge with an error rate of 3.81%, outperforming Gemini’s 7.63% and GPT4o’s 8.45%.
One area where Qwen3-ASR-Flash truly shines is in transcribing music. The model exhibited a remarkable error rate of 4.51% when recognizing lyrics from songs, far surpassing its competitors. This ability to understand music was confirmed during internal tests on full songs, where it scored a 9.96% error rate—a significant improvement over the 32.79% from Gemini-2.5-Pro and 58.59% from GPT4o-Transcribe.
Beyond its impressive accuracy, Qwen3-ASR-Flash introduces innovative features for next-generation AI transcription tools. One such innovation is its flexible contextual biasing, eliminating the need for complex keyword list formatting. Users can now feed the model background text in various formats to receive customized results, whether it’s a simple list of keywords, entire documents, or a mix of both.
This feature significantly reduces the need for complex preprocessing of contextual information. The model uses the provided context to enhance its accuracy without being heavily affected even if the text supplied is irrelevant.
Alibaba aims to position Qwen3-ASR-Flash as a global speech transcription tool, offering accurate transcription from a single model supporting 11 languages, including numerous dialects and accents. The model’s support for Chinese is particularly comprehensive, covering Mandarin, Cantonese, Sichuanese, Minnan (Hokkien), and Wu dialects.
For English speakers, the model handles various regional accents, including British and American. Additionally, it supports a wide range of other languages, such as French, German, Spanish, Italian, Portuguese, Russian, Japanese, Korean, and Arabic.
The model is capable of identifying which of the 11 languages is being spoken and effectively rejects non-speech segments like silence or background noise, ensuring cleaner output compared to previous AI speech transcription tools.