https://github.com/gwh22/UniVoice
This work introduces UniVoice, a novel approach that integrates autoregression and flow matching within a transformer-based framework for speech unified understanding and generation. UniVoice is designed to achieve both speech comprehension and generation capabilities through a unified model trained in a single stage. Our experiments demonstrate that UniVoice delivers strong performance for both automatic speech recognition and zero-shot speech synthesis tasks. By combining autoregression and flow matching, UniVoice establishes a foundation for expanding to additional audio understanding and generation tasks using the paradigm in the future.
This work introduces UniVoice, a novel approach that integrates autoregression and flow matching within a transformer-based framework for speech unified understanding and generation. UniVoice is designed to achieve both speech comprehension and generation capabilities through a unified model trained in a single stage. Our experiments demonstrate that UniVoice delivers strong performance for both automatic speech recognition and zero-shot speech synthesis tasks. By combining autoregression and flow matching, UniVoice establishes a foundation for expanding to additional audio understanding and generation tasks using the paradigm in the future.
GitHub
GitHub - gwh22/UniVoice
Contribute to gwh22/UniVoice development by creating an account on GitHub.
Announcing the AudioMOS Challenge 2025!
Homepage:https://sites.google.com/view/voicemos-challenge/audiomos-challenge-2025
We are enlarging the scope of the previous VoiceMOS challenge series to cover not only speech but also music and general audio.
Founded in 2022, the VoiceMOS Challenge (VMC) series aims to compare prediction techniques for human ratings of speech. To facilitate development in the automatic evaluation of audio generation systems, we decided to enlarge the scope and rename it as the AudioMOS Challenge.
Track 1: MOS prediction for text-to-music systems
This track is based on the MusicEval dataset, spanning 31 TTM systems, along with ratings collected from music experts. Evaluation was conducted across two axes: overall musical impression and alignment with the text prompt.
Track 2: Audiobox-aesthetics-style prediction for TTS, TTA and TTM samples
This track is based on the recently released Meta Audiobox Aesthetics, where they proposed four new axes: production quality, production complexity, content enjoyment, and content usefulness.
Track 3: MOS prediction for speech in high sampling frequencies
For the training set, we provide samples in 16/24/48kHz, and during evaluation, the participants are asked to evaluate samples reflecting their scores in a listening test that contains samples from all frqeuencies.
We are planning to submit a challenge proposal to ASRU2025. The challenge will start officially on April 9th. Please pre-register if interested!
Homepage:https://sites.google.com/view/voicemos-challenge/audiomos-challenge-2025
We are enlarging the scope of the previous VoiceMOS challenge series to cover not only speech but also music and general audio.
Founded in 2022, the VoiceMOS Challenge (VMC) series aims to compare prediction techniques for human ratings of speech. To facilitate development in the automatic evaluation of audio generation systems, we decided to enlarge the scope and rename it as the AudioMOS Challenge.
Track 1: MOS prediction for text-to-music systems
This track is based on the MusicEval dataset, spanning 31 TTM systems, along with ratings collected from music experts. Evaluation was conducted across two axes: overall musical impression and alignment with the text prompt.
Track 2: Audiobox-aesthetics-style prediction for TTS, TTA and TTM samples
This track is based on the recently released Meta Audiobox Aesthetics, where they proposed four new axes: production quality, production complexity, content enjoyment, and content usefulness.
Track 3: MOS prediction for speech in high sampling frequencies
For the training set, we provide samples in 16/24/48kHz, and during evaluation, the participants are asked to evaluate samples reflecting their scores in a listening test that contains samples from all frqeuencies.
We are planning to submit a challenge proposal to ASRU2025. The challenge will start officially on April 9th. Please pre-register if interested!
Google
VoiceMOS Challenge - AudioMOS Challenge 2025
News
This talk was here already but I watched it again recently and can recommend to revisit it again
Hearing the AGI from GMM HMM to GPT 4o Yu Zhang
November 15th LTI Colloquium Speaker
https://www.youtube.com/watch?v=pRUrO0x637A
Highly recommended:
1. Importance of scale
2. Importance of self-supervised learning for dirty data training
3. Very tricky case with dither seed and self-supervised learning
4. Voice search data is useless
5. Importance of multi-objective training (again)
6. Why readable transcripts (Whisper) better than good WER (RNNT)
7. Discussion on factors of audio and text data for audio LLM training
8. Size of the decoder and size of the encoder
Not always relevant for us gpu-poor guys but very nice overall.
Hearing the AGI from GMM HMM to GPT 4o Yu Zhang
November 15th LTI Colloquium Speaker
https://www.youtube.com/watch?v=pRUrO0x637A
Highly recommended:
1. Importance of scale
2. Importance of self-supervised learning for dirty data training
3. Very tricky case with dither seed and self-supervised learning
4. Voice search data is useless
5. Importance of multi-objective training (again)
6. Why readable transcripts (Whisper) better than good WER (RNNT)
7. Discussion on factors of audio and text data for audio LLM training
8. Size of the decoder and size of the encoder
Not always relevant for us gpu-poor guys but very nice overall.
YouTube
November 15th LTI Colloquium Speaker - Yu Zhang
Hearing the AGI from GMM HMM to GPT 4o Yu Zhang
https://github.com/zhai-lw/SQCodec
https://arxiv.org/abs/2504.04949
One Quantizer is Enough: Toward a Lightweight Audio Codec
Linwei Zhai, Han Ding, Cui Zhao, fei wang, Ge Wang, Wang Zhi, Wei Xi
https://arxiv.org/abs/2504.04949
One Quantizer is Enough: Toward a Lightweight Audio Codec
Linwei Zhai, Han Ding, Cui Zhao, fei wang, Ge Wang, Wang Zhi, Wei Xi
Uniform steps are definitely a problem in speech LLMs, couple of attempts to solve that which come together recently, the idea is that we apply text/speech alignment before we feed data into LLM:
https://github.com/FreedomIntelligence/Soundwave
https://github.com/mtkresearch/TASTE-SpokenLM
https://arxiv.org/abs/2502.12900
Soundwave: Less is More for Speech-Text Alignment in LLMs
Yuhao Zhang, Zhiheng Liu, Fan Bu, Ruiyu Zhang, Benyou Wang, Haizhou Li
Existing end-to-end speech large language models (LLMs) usually rely on large-scale annotated data for training, while data-efficient training has not been discussed in depth. We focus on two fundamental problems between speech and text: the representation space gap and sequence length inconsistency. We propose Soundwave, which utilizes an efficient training strategy and a novel architecture to address these issues. Results show that Soundwave outperforms the advanced Qwen2-Audio in speech translation and AIR-Bench speech tasks, using only one-fiftieth of the training data. Further analysis shows that Soundwave still retains its intelligence during conversation. The project is available at this https URL.
https://github.com/FreedomIntelligence/Soundwave
https://github.com/mtkresearch/TASTE-SpokenLM
https://arxiv.org/abs/2502.12900
Soundwave: Less is More for Speech-Text Alignment in LLMs
Yuhao Zhang, Zhiheng Liu, Fan Bu, Ruiyu Zhang, Benyou Wang, Haizhou Li
Existing end-to-end speech large language models (LLMs) usually rely on large-scale annotated data for training, while data-efficient training has not been discussed in depth. We focus on two fundamental problems between speech and text: the representation space gap and sequence length inconsistency. We propose Soundwave, which utilizes an efficient training strategy and a novel architecture to address these issues. Results show that Soundwave outperforms the advanced Qwen2-Audio in speech translation and AIR-Bench speech tasks, using only one-fiftieth of the training data. Further analysis shows that Soundwave still retains its intelligence during conversation. The project is available at this https URL.
GitHub
GitHub - FreedomIntelligence/Soundwave: The official Soundwave repository
The official Soundwave repository. Contribute to FreedomIntelligence/Soundwave development by creating an account on GitHub.
https://sites.google.com/view/respinasrchallenge2025/home
MADASR 2.0 : Multi-Lingual Multi-Dialect ASR Challenge in 8 Indian Languages
MADASR 2.0 : Multi-Lingual Multi-Dialect ASR Challenge in 8 Indian Languages
Google
MADASR2.0 '25
🔗 Join the MADASR 2.0 Google Group to connect with participants and stay updated!
Recent advances in automatic speech recognition (ASR) have been driven by self-supervised learning (SSL) models such as wav2vec2, and large-scale multilingual systems like…
Recent advances in automatic speech recognition (ASR) have been driven by self-supervised learning (SSL) models such as wav2vec2, and large-scale multilingual systems like…
The second baseline from https://x.com/xueyao_98 is now available!
Check out their technical blog and open-sourced code:
Blog: https://veiled-army-9c5.notion.site/Vevo1-5-1d2ce17b49a280b5b444d3fa2300c93a
Code: https://github.com/open-mmlab/Amphion/tree/main/models/svc/vevosing
Training data will be distributed starting April 28th.
Register for SVCC here: https://forms.gle/GZGAWJAZvgDK6QKcA
Check out their technical blog and open-sourced code:
Blog: https://veiled-army-9c5.notion.site/Vevo1-5-1d2ce17b49a280b5b444d3fa2300c93a
Code: https://github.com/open-mmlab/Amphion/tree/main/models/svc/vevosing
Training data will be distributed starting April 28th.
Register for SVCC here: https://forms.gle/GZGAWJAZvgDK6QKcA
Epic review process of MegaTTS3
https://openreview.net/forum?id=o362EkNU2z
the model itself is well designed, we love non-autoregressive models and MFA aligner too!
https://github.com/bytedance/MegaTTS3
https://openreview.net/forum?id=o362EkNU2z
the model itself is well designed, we love non-autoregressive models and MFA aligner too!
https://github.com/bytedance/MegaTTS3
openreview.net
Sparse Alignment Enhanced Latent Diffusion Transformer for...
While recent zero-shot text-to-speech (TTS) models have significantly improved speech quality and expressiveness,
mainstream systems still suffer from issues related to speech-text alignment...
mainstream systems still suffer from issues related to speech-text alignment...
Talking Turns: Benchmarking Audio Foundation Models on Turn-Taking Dynamics
https://arxiv.org/abs/2503.01174
https://x.com/Sid_Arora_18/status/1897315720205328593
As usual, baseline cascaded system is intentionally weak. Whisper tiny as a baseline???
https://arxiv.org/abs/2503.01174
https://x.com/Sid_Arora_18/status/1897315720205328593
As usual, baseline cascaded system is intentionally weak. Whisper tiny as a baseline???
Dataset generated with OpenAI
https://huggingface.co/datasets/laion/laions_got_talent
"LAION's Got Talent" is a generated dataset comprising voice acting samples that exhibit a wide range of emotions, vocal bursts, topics, and content. This dataset is a component of the BUD-E project, spearheaded by LAION with support from Intel.
https://huggingface.co/datasets/laion/laions_got_talent
"LAION's Got Talent" is a generated dataset comprising voice acting samples that exhibit a wide range of emotions, vocal bursts, topics, and content. This dataset is a component of the BUD-E project, spearheaded by LAION with support from Intel.
huggingface.co
laion/laions_got_talent · Datasets at Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
NVIDIA released new English model improving leaderboard results
https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2
https://huggingface.co/spaces/hf-audio/open_asr_leaderboard
https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2
https://huggingface.co/spaces/hf-audio/open_asr_leaderboard
huggingface.co
nvidia/parakeet-tdt-0.6b-v2 · Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
517 pages of instrument/vocals separation
https://docs.google.com/document/d/17fjNvJzj8ZGSer7c7OFe_CNfUKbAxEh_OBv94ZdRG5c/
Instrumental and vocal & stems separation & mastering
(UVR 5 GUI: VR/MDX-Net/MDX23C/Demucs 1-4, and BS/Mel-Roformer in beta
MVSEP-MDX23-Colab/KaraFan/drumsep/LarsNet/SCNet
x-minus.pro (uvronline.app)/mvsep.com/
GSEP/Dango.ai/Audioshake/Music.ai)
https://docs.google.com/document/d/17fjNvJzj8ZGSer7c7OFe_CNfUKbAxEh_OBv94ZdRG5c/
Instrumental and vocal & stems separation & mastering
(UVR 5 GUI: VR/MDX-Net/MDX23C/Demucs 1-4, and BS/Mel-Roformer in beta
MVSEP-MDX23-Colab/KaraFan/drumsep/LarsNet/SCNet
x-minus.pro (uvronline.app)/mvsep.com/
GSEP/Dango.ai/Audioshake/Music.ai)
Daniels Povey's talk
https://youtube.com/watch?v=2B1-gKDTuh0
overall, consistency-based training is getting more and more importance these days in different areas - TTS too. As training data amount comes to the limit, better supervision brings more results.
https://youtube.com/watch?v=2B1-gKDTuh0
overall, consistency-based training is getting more and more importance these days in different areas - TTS too. As training data amount comes to the limit, better supervision brings more results.
YouTube
"CR-CTC: Consistency regularization on CTC for improved speech recognition" - Daniel Povey
Talk 18 of the Conversational AI Reading Group about "CR-CTC: Consistency regularization on CTC for improved speech recognition" by Daniel Povey.
For further information about the Reading Group, please check out https://poonehmousavi.github.io/rg
For further information about the Reading Group, please check out https://poonehmousavi.github.io/rg
New multilingual speech restoration paper out Miipher-2 🚀! The RTF on a TPU is 0.0078: 1 million hours of data can be cleaned in 3 days using just 100 TPUs!
Paper: https://arxiv.org/abs/2505.04457
Demo: https://google.github.io/df-conformer/miipher2/
Paper: https://arxiv.org/abs/2505.04457
Demo: https://google.github.io/df-conformer/miipher2/
arXiv.org
Miipher-2: A Universal Speech Restoration Model for Million-Hour...
Training data cleaning is a new application for generative model-based speech restoration (SR). This paper introduces Miipher-2, an SR model designed for million-hour scale data, for training data...
18B speech recognition models
https://huggingface.co/collections/espnet/owls-scaling-laws-for-speech-recognition-and-translation-67ab7f991c194065f057ce8d
https://arxiv.org/abs/2502.10373
OWLS: Scaling Laws for Multilingual Speech Recognition and Translation Models
William Chen, Jinchuan Tian, Yifan Peng, Brian Yan, Chao-Han Huck Yang, Shinji Watanabe
Neural scaling laws offer valuable insights for designing robust sequence processing architectures. While these laws have been extensively characterized in other modalities, their behavior in speech remains comparatively underexplored. In this work, we introduce OWLS, an open-access, reproducible suite of multilingual speech recognition and translation models spanning 0.25B to 18B parameters, with the 18B version being the largest speech model, to the best of our knowledge. OWLS leverages up to 360K hours of public speech data across 150 languages, enabling a systematic investigation into how data, model, and compute scaling each influence performance in multilingual speech tasks. We use OWLS to derive neural scaling laws, showing how final performance can be reliably predicted when scaling. One of our key findings is that scaling enhances performance on low-resource languages/dialects, helping to mitigate bias and improve the accessibility of speech technologies. Finally, we show how OWLS can be used to power new research directions by discovering emergent abilities in la
https://huggingface.co/collections/espnet/owls-scaling-laws-for-speech-recognition-and-translation-67ab7f991c194065f057ce8d
https://arxiv.org/abs/2502.10373
OWLS: Scaling Laws for Multilingual Speech Recognition and Translation Models
William Chen, Jinchuan Tian, Yifan Peng, Brian Yan, Chao-Han Huck Yang, Shinji Watanabe
Neural scaling laws offer valuable insights for designing robust sequence processing architectures. While these laws have been extensively characterized in other modalities, their behavior in speech remains comparatively underexplored. In this work, we introduce OWLS, an open-access, reproducible suite of multilingual speech recognition and translation models spanning 0.25B to 18B parameters, with the 18B version being the largest speech model, to the best of our knowledge. OWLS leverages up to 360K hours of public speech data across 150 languages, enabling a systematic investigation into how data, model, and compute scaling each influence performance in multilingual speech tasks. We use OWLS to derive neural scaling laws, showing how final performance can be reliably predicted when scaling. One of our key findings is that scaling enhances performance on low-resource languages/dialects, helping to mitigate bias and improve the accessibility of speech technologies. Finally, we show how OWLS can be used to power new research directions by discovering emergent abilities in la
huggingface.co
OWLS: Scaling Laws for Speech Recognition and Translation - a espnet Collection
🦉 A suite of Whisper-style models from 250M to 18B parameters. Trained on up to 360K hours of data. 16k sampling rate.
Gemini models are very good (and recent 2.5 preview is even better)
https://github.com/ddlBoJack/MMAR
https://github.com/ddlBoJack/MMAR
GitHub
GitHub - ddlBoJack/MMAR: Benchmark data and code for MMAR: A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music,…
Benchmark data and code for MMAR: A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix - ddlBoJack/MMAR
These ConvAI videos are somewhat good, better than papers, I could again recommend both two recent ones:
"Automatic Quality Assessment for Speech and Beyond"
Very important classification of speech quality aspects inside
https://www.youtube.com/watch?v=REH034Wm3so
"Automatic Quality Assessment for Speech and Beyond"
Very important classification of speech quality aspects inside
https://www.youtube.com/watch?v=REH034Wm3so
YouTube
"Automatic Quality Assessment for Speech and Beyond" - Wen-Chin Huang
Talk 20 of the Conversational AI Reading Group about "Automatic Quality Assessment for Speech and Beyond" by Wen-Chin Huang.
For further information about the Reading Group, please check out https://poonehmousavi.github.io/rg
For further information about the Reading Group, please check out https://poonehmousavi.github.io/rg
Voicebox is fundamental model by itself, but this talk has very interesting part about applications of synthetic data to ASR model training
https://www.youtube.com/watch?v=PKleJNikO8M
https://www.youtube.com/watch?v=PKleJNikO8M
YouTube
"The Voicebox Model and Its Applications" - Leda Sari
Talk 19 of the Conversational AI Reading Group about "The Voicebox Model and Its Applications" by Leda Sari.
For further information about the Reading Group, please check out https://poonehmousavi.github.io/rg
For further information about the Reading Group, please check out https://poonehmousavi.github.io/rg