Some real benchmarks for speech LLMs. ASR + text LLM still wins
https://github.com/MatthewCYM/VoiceBench
https://github.com/MatthewCYM/VoiceBench
GitHub
GitHub - MatthewCYM/VoiceBench: VoiceBench: Benchmarking LLM-Based Voice Assistants
VoiceBench: Benchmarking LLM-Based Voice Assistants - MatthewCYM/VoiceBench
https://arxiv.org/abs/2502.06490
Recent Advances in Discrete Speech Tokens: A Review
Yiwei Guo, Zhihan Li, Hankun Wang, Bohan Li, Chongtian Shao, Hanglei Zhang, Chenpeng Du, Xie Chen, Shujie Liu, Kai Yu
The rapid advancement of speech generation technologies in the era of large language models (LLMs) has established discrete speech tokens as a foundational paradigm for speech representation. These tokens, characterized by their discrete, compact, and concise nature, are not only advantageous for efficient transmission and storage, but also inherently compatible with the language modeling framework, enabling seamless integration of speech into text-dominated LLM architectures. Current research categorizes discrete speech tokens into two principal classes: acoustic tokens and semantic tokens, each of which has evolved into a rich research domain characterized by unique design philosophies and methodological approaches. This survey systematically synthesizes the existing taxonomy and recent innovations in discrete speech tokenization, conducts a critical examination of the strengths and limitations of each paradigm, and presents systematic experimental comparisons across token types. Furthermore, we identify persistent challenges in the field and propose potential research directions, aiming to offer actionable insights to inspire future advancements in the development and application of discrete speech tokens.
Recent Advances in Discrete Speech Tokens: A Review
Yiwei Guo, Zhihan Li, Hankun Wang, Bohan Li, Chongtian Shao, Hanglei Zhang, Chenpeng Du, Xie Chen, Shujie Liu, Kai Yu
The rapid advancement of speech generation technologies in the era of large language models (LLMs) has established discrete speech tokens as a foundational paradigm for speech representation. These tokens, characterized by their discrete, compact, and concise nature, are not only advantageous for efficient transmission and storage, but also inherently compatible with the language modeling framework, enabling seamless integration of speech into text-dominated LLM architectures. Current research categorizes discrete speech tokens into two principal classes: acoustic tokens and semantic tokens, each of which has evolved into a rich research domain characterized by unique design philosophies and methodological approaches. This survey systematically synthesizes the existing taxonomy and recent innovations in discrete speech tokenization, conducts a critical examination of the strengths and limitations of each paradigm, and presents systematic experimental comparisons across token types. Furthermore, we identify persistent challenges in the field and propose potential research directions, aiming to offer actionable insights to inspire future advancements in the development and application of discrete speech tokens.
arXiv.org
Recent Advances in Discrete Speech Tokens: A Review
The rapid advancement of speech generation technologies in the era of large language models (LLMs) has established discrete speech tokens as a foundational paradigm for speech representation....
I've spent some time on generative error correction recently, more numbers and results on it later, meanwhile the paper
https://arxiv.org/abs/2409.09785
Large Language Model Based Generative Error Correction: A Challenge and Baselines for Speech Recognition, Speaker Tagging, and Emotion Recognition
Chao-Han Huck Yang, Taejin Park, Yuan Gong, Yuanchao Li, Zhehuai Chen, Yen-Ting Lin, Chen Chen, Yuchen Hu, Kunal Dhawan, Piotr Żelasko, Chao Zhang, Yun-Nung Chen, Yu Tsao, Jagadeesh Balam, Boris Ginsburg, Sabato Marco Siniscalchi, Eng Siong Chng, Peter Bell, Catherine Lai, Shinji Watanabe, Andreas Stolcke
Given recent advances in generative AI technology, a key question is how large language models (LLMs) can enhance acoustic modeling tasks using text decoding results from a frozen, pretrained automatic speech recognition (ASR) model. To explore new capabilities in language modeling for speech processing, we introduce the generative speech transcription error correction (GenSEC) challenge. This challenge comprises three post-ASR language modeling tasks: (i) post-ASR transcription correction, (ii) speaker tagging, and (iii) emotion recognition. These tasks aim to emulate future LLM-based agents handling voice-based interfaces while remaining accessible to a broad audience by utilizing open pretrained language models or agent-based APIs. We also discuss insights from baseline evaluations, as well as lessons learned for designing future evaluations.
another older paper here
https://www.tg-me.com/speechtech/1962
https://arxiv.org/abs/2409.09785
Large Language Model Based Generative Error Correction: A Challenge and Baselines for Speech Recognition, Speaker Tagging, and Emotion Recognition
Chao-Han Huck Yang, Taejin Park, Yuan Gong, Yuanchao Li, Zhehuai Chen, Yen-Ting Lin, Chen Chen, Yuchen Hu, Kunal Dhawan, Piotr Żelasko, Chao Zhang, Yun-Nung Chen, Yu Tsao, Jagadeesh Balam, Boris Ginsburg, Sabato Marco Siniscalchi, Eng Siong Chng, Peter Bell, Catherine Lai, Shinji Watanabe, Andreas Stolcke
Given recent advances in generative AI technology, a key question is how large language models (LLMs) can enhance acoustic modeling tasks using text decoding results from a frozen, pretrained automatic speech recognition (ASR) model. To explore new capabilities in language modeling for speech processing, we introduce the generative speech transcription error correction (GenSEC) challenge. This challenge comprises three post-ASR language modeling tasks: (i) post-ASR transcription correction, (ii) speaker tagging, and (iii) emotion recognition. These tasks aim to emulate future LLM-based agents handling voice-based interfaces while remaining accessible to a broad audience by utilizing open pretrained language models or agent-based APIs. We also discuss insights from baseline evaluations, as well as lessons learned for designing future evaluations.
another older paper here
https://www.tg-me.com/speechtech/1962
arXiv.org
Large Language Model Based Generative Error Correction: A...
Given recent advances in generative AI technology, a key question is how large language models (LLMs) can enhance acoustic modeling tasks using text decoding results from a frozen, pretrained...
https://github.com/soham97/mellow
https://arxiv.org/abs/2503.08540
Mellow: a small audio language model for reasoning
Soham Deshmukh, Satvik Dixit, Rita Singh, Bhiksha Raj
Multimodal Audio-Language Models (ALMs) can understand and reason over both audio and text. Typically, reasoning performance correlates with model size, with the best results achieved by models exceeding 8 billion parameters. However, no prior work has explored enabling small audio-language models to perform reasoning tasks, despite the potential applications for edge devices. To address this gap, we introduce Mellow, a small Audio-Language Model specifically designed for reasoning. Mellow achieves state-of-the-art performance among existing small audio-language models and surpasses several larger models in reasoning capabilities. For instance, Mellow scores 52.11 on MMAU, comparable to SoTA Qwen2 Audio (which scores 52.5) while using 50 times fewer parameters and being trained on 60 times less data (audio hrs). To train Mellow, we introduce ReasonAQA, a dataset designed to enhance audio-grounded reasoning in models. It consists of a mixture of existing datasets (30% of the data) and synthetically generated data (70%). The synthetic dataset is derived from audio captioning datasets, where Large Language Models (LLMs) generate detailed and multiple-choice questions focusing on audio events, objects, acoustic scenes, signal properties, semantics, and listener emotions. To evaluate Mellow's reasoning ability, we benchmark it on a diverse set of tasks, assessing on both in-distribution and out-of-distribution data, including audio understanding, deductive reasoning, and comparative reasoning. Finally, we conduct extensive ablation studies to explore the impact of projection layer choices, synthetic data generation methods, and language model pretraining on reasoning performance. Our training dataset, findings, and baseline pave the way for developing small ALMs capable of reasoning.
https://arxiv.org/abs/2503.08540
Mellow: a small audio language model for reasoning
Soham Deshmukh, Satvik Dixit, Rita Singh, Bhiksha Raj
Multimodal Audio-Language Models (ALMs) can understand and reason over both audio and text. Typically, reasoning performance correlates with model size, with the best results achieved by models exceeding 8 billion parameters. However, no prior work has explored enabling small audio-language models to perform reasoning tasks, despite the potential applications for edge devices. To address this gap, we introduce Mellow, a small Audio-Language Model specifically designed for reasoning. Mellow achieves state-of-the-art performance among existing small audio-language models and surpasses several larger models in reasoning capabilities. For instance, Mellow scores 52.11 on MMAU, comparable to SoTA Qwen2 Audio (which scores 52.5) while using 50 times fewer parameters and being trained on 60 times less data (audio hrs). To train Mellow, we introduce ReasonAQA, a dataset designed to enhance audio-grounded reasoning in models. It consists of a mixture of existing datasets (30% of the data) and synthetically generated data (70%). The synthetic dataset is derived from audio captioning datasets, where Large Language Models (LLMs) generate detailed and multiple-choice questions focusing on audio events, objects, acoustic scenes, signal properties, semantics, and listener emotions. To evaluate Mellow's reasoning ability, we benchmark it on a diverse set of tasks, assessing on both in-distribution and out-of-distribution data, including audio understanding, deductive reasoning, and comparative reasoning. Finally, we conduct extensive ablation studies to explore the impact of projection layer choices, synthetic data generation methods, and language model pretraining on reasoning performance. Our training dataset, findings, and baseline pave the way for developing small ALMs capable of reasoning.
GitHub
GitHub - soham97/mellow: small audio language model for reasoning
small audio language model for reasoning. Contribute to soham97/mellow development by creating an account on GitHub.
ICASSP 2025 papers are available online now
https://ieeexplore.ieee.org/xpl/conhome/10887540/proceeding?isnumber=10887541
the program website is wierd
https://icassp25.conflux.events/program
https://ieeexplore.ieee.org/xpl/conhome/10887540/proceeding?isnumber=10887541
the program website is wierd
https://icassp25.conflux.events/program
icassp25.conflux.events
CONFlux: Your conference companion.
Interesting distillation of Kokoro
https://github.com/EndlessReform/smoltts
They also have speech dataset encoded with Mimi, LibriTTS-r is just 60Mb
https://github.com/EndlessReform/smoltts
They also have speech dataset encoded with Mimi, LibriTTS-r is just 60Mb
GitHub
GitHub - EndlessReform/smoltts: Open TTS models, built for streaming on the edge
Open TTS models, built for streaming on the edge. Contribute to EndlessReform/smoltts development by creating an account on GitHub.
Some more results from our experiments with GEC with LLMs
https://alphacephei.com/nsh/2025/03/15/generative-error-correction.html
Most 8B models at 4b quantization are not very stable, hallucinations present in about 25% cases. Qwen is very unstable for this task.
Gemma2 and Gemma3 are ok, yet to try 27B version.
Simple prompt from the papers certainly doesn’t work. One has to provide much more details and specific issues in prompt. We yet to work on the prompt more.
Even prompt formatting matters, by modifying the prompt format we were able to reduce WER from 26% to 16%
For now GEC doesn’t seem like a breakthrough tech, it seems like something like extra sause is needed, simple ROVER is equally ok and much more stable.
We discussed on the channel with iLa that English prompt helps for non-English language. I think it is possible for some models but I can’t confirm in experiments.
For big model input split doesn’t help much.
There are still a lot of overcorrection of proper names which are rare and unknown to LLM and overcorrection of grammar. We need to work more on it.
The difference between Gemma2-9B and Gemini Flash is not very large except for number of hallucinations.
Most models have very poor knowledge in rare domains and poor knowledge about speech (phonetics).
https://alphacephei.com/nsh/2025/03/15/generative-error-correction.html
Most 8B models at 4b quantization are not very stable, hallucinations present in about 25% cases. Qwen is very unstable for this task.
Gemma2 and Gemma3 are ok, yet to try 27B version.
Simple prompt from the papers certainly doesn’t work. One has to provide much more details and specific issues in prompt. We yet to work on the prompt more.
Even prompt formatting matters, by modifying the prompt format we were able to reduce WER from 26% to 16%
For now GEC doesn’t seem like a breakthrough tech, it seems like something like extra sause is needed, simple ROVER is equally ok and much more stable.
We discussed on the channel with iLa that English prompt helps for non-English language. I think it is possible for some models but I can’t confirm in experiments.
For big model input split doesn’t help much.
There are still a lot of overcorrection of proper names which are rare and unknown to LLM and overcorrection of grammar. We need to work more on it.
The difference between Gemma2-9B and Gemini Flash is not very large except for number of hallucinations.
Most models have very poor knowledge in rare domains and poor knowledge about speech (phonetics).
Speech Recognition With Vosk
Experiments with correction of speech recognition output with LLMs
Generative error correction is a thing recently, there are many papers on that, even a challenge:
Twitter suggested me this paper on GEC stressing the named entity recognition issue, right on the subject:
https://arxiv.org/abs/2410.13198
https://arxiv.org/abs/2410.13198
arXiv.org
Failing Forward: Improving Generative Error Correction for ASR...
Generative Error Correction (GEC) has emerged as a powerful post-processing method to enhance the performance of Automatic Speech Recognition (ASR) systems. However, we show that GEC models...
https://x.com/PiotrZelasko/status/1902723841534681357
Canary-1B-Flash and Canary-180M-Flash - two new variants of Canary optimized for fast training and inference.
Key features of Canary-1B-Flash:
* Several times faster!
* More accurate than Canary-1B!
* Word-level timestamps!
* Dropped NC license!
Both models support the same set of languages as original Canary-1B: English, French, Spanish, and German.
Canary-1B-Flash and Canary-180M-Flash - two new variants of Canary optimized for fast training and inference.
Key features of Canary-1B-Flash:
* Several times faster!
* More accurate than Canary-1B!
* Word-level timestamps!
* Dropped NC license!
Both models support the same set of languages as original Canary-1B: English, French, Spanish, and German.
X (formerly Twitter)
Piotr Żelasko (@PiotrZelasko) on X
Today we released @nvidia Canary-1B-Flash and Canary-180M-Flash - two new variants of Canary optimized for fast training and inference.
Key features of Canary-1B-Flash:
🔥 Several times faster!
📉 More accurate than Canary-1B!
🕰️ Word-level timestamps!
💰…
Key features of Canary-1B-Flash:
🔥 Several times faster!
📉 More accurate than Canary-1B!
🕰️ Word-level timestamps!
💰…
https://github.com/canopyai/Orpheus-TTS/issues/10#issuecomment-2740645470
christophschuhmann left a comment (canopyai/Orpheus-TTS#10)
Hey, Christoph from Laion here, the guy who made the Laion 5 billion data set. I have been making a voice acting data set with some donations from Intel with altogether 5,000 hours of high quality voice acting.
https://huggingface.co/datasets/laion/laions_got_talent_enhanced_flash_annotations_and_long_captions
https://huggingface.co/datasets/laion/laions_got_talent_raw
I was using HyperLab, which is a reseller for OpenAI API, so I never actually agreed to the OpenAI terms of service and then prompted the voice API to role play like an actor at a casting audition. This way I generated evenly distributed utterances over 40 emotion categories for all 11 OpenAI voices for English, German, French, and Spanish. The data is already online and I also have very detailed emotion captions. I will make an official release in the next few weeks, but you could already take it and tune German, Spanish, and French models on it. I would be very happy about a capable German model because I want to deploy voice assistants in schools in Germany. I'm doing all of this in my free time and I am still a high school teacher and want to keep it this way. In the following repository, the quality is the best, but unfortunately I lost the accent labels for the English samples. Some samples in the English part are with accents. In the second repository here, you find the unenhanced data, which is slightly lower from the recording quality, but you can find in the emotion entry of the JSON the corresponding accent. For English, I generated 14 different accents. German, Spanish, and French don't have any accents. Have fun!
christophschuhmann left a comment (canopyai/Orpheus-TTS#10)
Hey, Christoph from Laion here, the guy who made the Laion 5 billion data set. I have been making a voice acting data set with some donations from Intel with altogether 5,000 hours of high quality voice acting.
https://huggingface.co/datasets/laion/laions_got_talent_enhanced_flash_annotations_and_long_captions
https://huggingface.co/datasets/laion/laions_got_talent_raw
I was using HyperLab, which is a reseller for OpenAI API, so I never actually agreed to the OpenAI terms of service and then prompted the voice API to role play like an actor at a casting audition. This way I generated evenly distributed utterances over 40 emotion categories for all 11 OpenAI voices for English, German, French, and Spanish. The data is already online and I also have very detailed emotion captions. I will make an official release in the next few weeks, but you could already take it and tune German, Spanish, and French models on it. I would be very happy about a capable German model because I want to deploy voice assistants in schools in Germany. I'm doing all of this in my free time and I am still a high school teacher and want to keep it this way. In the following repository, the quality is the best, but unfortunately I lost the accent labels for the English samples. Some samples in the English part are with accents. In the second repository here, you find the unenhanced data, which is slightly lower from the recording quality, but you can find in the emotion entry of the JSON the corresponding accent. For English, I generated 14 different accents. German, Spanish, and French don't have any accents. Have fun!
GitHub
training / finetuning other languages · Issue #10 · canopyai/Orpheus-TTS
Amazing project! Would we be able to finetune on other languages? Could we just give finetune with a large dataset for another language or would it require other changes ? Would it handle other alp...
No day without new TTS
https://x.com/anuj_diwan/status/1902884487718965330
If you'd like an open-source text-to-speech model that follows your style instructions, consider using our ParaSpeechCaps-based model!
Model: https://huggingface.co/ajd12342/parler-tts-mini-v1-paraspeechcaps
Paper: https://arxiv.org/abs/2503.04713
https://x.com/anuj_diwan/status/1902884487718965330
If you'd like an open-source text-to-speech model that follows your style instructions, consider using our ParaSpeechCaps-based model!
Model: https://huggingface.co/ajd12342/parler-tts-mini-v1-paraspeechcaps
Paper: https://arxiv.org/abs/2503.04713
X (formerly Twitter)
Anuj Diwan (@anuj_diwan) on X
If you'd like an open-source text-to-speech model that follows your style instructions, consider using our ParaSpeechCaps-based model!
Model: https://t.co/HCm71MW0aR
Paper: https://t.co/0DeVodn8SU
Model: https://t.co/HCm71MW0aR
Paper: https://t.co/0DeVodn8SU
All comes from NLP
https://github.com/Bartelds/ctc-dro
https://arxiv.org/abs/2502.01777
CTC-DRO: Robust Optimization for Reducing Language Disparities in Speech Recognition
Martijn Bartelds, Ananjan Nandi, Moussa Koulako Bala Doumbouya, Dan Jurafsky, Tatsunori Hashimoto, Karen Livescu
Modern deep learning models often achieve high overall performance, but consistently fail on specific subgroups. Group distributionally robust optimization (group DRO) addresses this problem by minimizing the worst-group loss, but it fails when group losses misrepresent performance differences between groups. This is common in domains like speech, where the widely used connectionist temporal classification (CTC) loss scales with input length and varies with linguistic and acoustic properties, leading to spurious differences between group losses. We present CTC-DRO, which addresses the shortcomings of the group DRO objective by smoothing the group weight update to prevent overemphasis on consistently high-loss groups, while using input length-matched batching to mitigate CTC's scaling issues. We evaluate CTC-DRO on the task of multilingual automatic speech recognition (ASR) across five language sets from the ML-SUPERB 2.0 benchmark. CTC-DRO consistently outperforms group DRO and CTC-based baseline models, reducing the worst-language error by up to 47.1% and the average error by up to 32.9%. CTC-DRO can be applied to ASR with minimal computational costs, and offers the potential for reducing group disparities in other domains with similar challenges.
https://github.com/Bartelds/ctc-dro
https://arxiv.org/abs/2502.01777
CTC-DRO: Robust Optimization for Reducing Language Disparities in Speech Recognition
Martijn Bartelds, Ananjan Nandi, Moussa Koulako Bala Doumbouya, Dan Jurafsky, Tatsunori Hashimoto, Karen Livescu
Modern deep learning models often achieve high overall performance, but consistently fail on specific subgroups. Group distributionally robust optimization (group DRO) addresses this problem by minimizing the worst-group loss, but it fails when group losses misrepresent performance differences between groups. This is common in domains like speech, where the widely used connectionist temporal classification (CTC) loss scales with input length and varies with linguistic and acoustic properties, leading to spurious differences between group losses. We present CTC-DRO, which addresses the shortcomings of the group DRO objective by smoothing the group weight update to prevent overemphasis on consistently high-loss groups, while using input length-matched batching to mitigate CTC's scaling issues. We evaluate CTC-DRO on the task of multilingual automatic speech recognition (ASR) across five language sets from the ML-SUPERB 2.0 benchmark. CTC-DRO consistently outperforms group DRO and CTC-based baseline models, reducing the worst-language error by up to 47.1% and the average error by up to 32.9%. CTC-DRO can be applied to ASR with minimal computational costs, and offers the potential for reducing group disparities in other domains with similar challenges.
GitHub
GitHub - Bartelds/ctc-dro: Code associated with the paper: CTC-DRO: Robust Optimization for Reducing Language Disparities in Speech…
Code associated with the paper: CTC-DRO: Robust Optimization for Reducing Language Disparities in Speech Recognition. - Bartelds/ctc-dro
https://github.com/DataoceanAI/Dolphin
Dolphin is a multilingual, multitask ASR model developed through a collaboration between Dataocean AI and Tsinghua University. It supports 40 Eastern languages across East Asia, South Asia, Southeast Asia, and the Middle East, while also supporting 22 Chinese dialects. It is trained on over 210,000 hours of data, which includes both DataoceanAI's proprietary datasets and open-source datasets. The model can perform speech recognition, voice activity detection (VAD), segmentation, and language identification (LID).
Supports Russian, Uzbek, Kazakh, Tajik, etc
https://github.com/DataoceanAI/Dolphin/blob/main/languages.md
Dolphin is a multilingual, multitask ASR model developed through a collaboration between Dataocean AI and Tsinghua University. It supports 40 Eastern languages across East Asia, South Asia, Southeast Asia, and the Middle East, while also supporting 22 Chinese dialects. It is trained on over 210,000 hours of data, which includes both DataoceanAI's proprietary datasets and open-source datasets. The model can perform speech recognition, voice activity detection (VAD), segmentation, and language identification (LID).
Supports Russian, Uzbek, Kazakh, Tajik, etc
https://github.com/DataoceanAI/Dolphin/blob/main/languages.md
GitHub
GitHub - DataoceanAI/Dolphin: Dolphin is a multilingual, multitask ASR model jointly trained by DataoceanAI and Tsinghua University.
Dolphin is a multilingual, multitask ASR model jointly trained by DataoceanAI and Tsinghua University. - GitHub - DataoceanAI/Dolphin: Dolphin is a multilingual, multitask ASR model jointly traine...
https://github.com/gwh22/UniVoice
This work introduces UniVoice, a novel approach that integrates autoregression and flow matching within a transformer-based framework for speech unified understanding and generation. UniVoice is designed to achieve both speech comprehension and generation capabilities through a unified model trained in a single stage. Our experiments demonstrate that UniVoice delivers strong performance for both automatic speech recognition and zero-shot speech synthesis tasks. By combining autoregression and flow matching, UniVoice establishes a foundation for expanding to additional audio understanding and generation tasks using the paradigm in the future.
This work introduces UniVoice, a novel approach that integrates autoregression and flow matching within a transformer-based framework for speech unified understanding and generation. UniVoice is designed to achieve both speech comprehension and generation capabilities through a unified model trained in a single stage. Our experiments demonstrate that UniVoice delivers strong performance for both automatic speech recognition and zero-shot speech synthesis tasks. By combining autoregression and flow matching, UniVoice establishes a foundation for expanding to additional audio understanding and generation tasks using the paradigm in the future.
GitHub
GitHub - gwh22/UniVoice
Contribute to gwh22/UniVoice development by creating an account on GitHub.
Announcing the AudioMOS Challenge 2025!
Homepage:https://sites.google.com/view/voicemos-challenge/audiomos-challenge-2025
We are enlarging the scope of the previous VoiceMOS challenge series to cover not only speech but also music and general audio.
Founded in 2022, the VoiceMOS Challenge (VMC) series aims to compare prediction techniques for human ratings of speech. To facilitate development in the automatic evaluation of audio generation systems, we decided to enlarge the scope and rename it as the AudioMOS Challenge.
Track 1: MOS prediction for text-to-music systems
This track is based on the MusicEval dataset, spanning 31 TTM systems, along with ratings collected from music experts. Evaluation was conducted across two axes: overall musical impression and alignment with the text prompt.
Track 2: Audiobox-aesthetics-style prediction for TTS, TTA and TTM samples
This track is based on the recently released Meta Audiobox Aesthetics, where they proposed four new axes: production quality, production complexity, content enjoyment, and content usefulness.
Track 3: MOS prediction for speech in high sampling frequencies
For the training set, we provide samples in 16/24/48kHz, and during evaluation, the participants are asked to evaluate samples reflecting their scores in a listening test that contains samples from all frqeuencies.
We are planning to submit a challenge proposal to ASRU2025. The challenge will start officially on April 9th. Please pre-register if interested!
Homepage:https://sites.google.com/view/voicemos-challenge/audiomos-challenge-2025
We are enlarging the scope of the previous VoiceMOS challenge series to cover not only speech but also music and general audio.
Founded in 2022, the VoiceMOS Challenge (VMC) series aims to compare prediction techniques for human ratings of speech. To facilitate development in the automatic evaluation of audio generation systems, we decided to enlarge the scope and rename it as the AudioMOS Challenge.
Track 1: MOS prediction for text-to-music systems
This track is based on the MusicEval dataset, spanning 31 TTM systems, along with ratings collected from music experts. Evaluation was conducted across two axes: overall musical impression and alignment with the text prompt.
Track 2: Audiobox-aesthetics-style prediction for TTS, TTA and TTM samples
This track is based on the recently released Meta Audiobox Aesthetics, where they proposed four new axes: production quality, production complexity, content enjoyment, and content usefulness.
Track 3: MOS prediction for speech in high sampling frequencies
For the training set, we provide samples in 16/24/48kHz, and during evaluation, the participants are asked to evaluate samples reflecting their scores in a listening test that contains samples from all frqeuencies.
We are planning to submit a challenge proposal to ASRU2025. The challenge will start officially on April 9th. Please pre-register if interested!
Google
VoiceMOS Challenge - AudioMOS Challenge 2025
News