Speech Technology

Some real benchmarks for speech LLMs. ASR + text LLM still wins

https://github.com/MatthewCYM/VoiceBench

GitHub - MatthewCYM/VoiceBench: VoiceBench: Benchmarking LLM-Based Voice Assistants

VoiceBench: Benchmarking LLM-Based Voice Assistants - MatthewCYM/VoiceBench

1.5K viewsedited 16:12

https://arxiv.org/abs/2502.06490

Recent Advances in Discrete Speech Tokens: A Review

Yiwei Guo, Zhihan Li, Hankun Wang, Bohan Li, Chongtian Shao, Hanglei Zhang, Chenpeng Du, Xie Chen, Shujie Liu, Kai Yu

The rapid advancement of speech generation technologies in the era of large language models (LLMs) has established discrete speech tokens as a foundational paradigm for speech representation. These tokens, characterized by their discrete, compact, and concise nature, are not only advantageous for efficient transmission and storage, but also inherently compatible with the language modeling framework, enabling seamless integration of speech into text-dominated LLM architectures. Current research categorizes discrete speech tokens into two principal classes: acoustic tokens and semantic tokens, each of which has evolved into a rich research domain characterized by unique design philosophies and methodological approaches. This survey systematically synthesizes the existing taxonomy and recent innovations in discrete speech tokenization, conducts a critical examination of the strengths and limitations of each paradigm, and presents systematic experimental comparisons across token types. Furthermore, we identify persistent challenges in the field and propose potential research directions, aiming to offer actionable insights to inspire future advancements in the development and application of discrete speech tokens.

arXiv.org

Recent Advances in Discrete Speech Tokens: A Review

The rapid advancement of speech generation technologies in the era of large language models (LLMs) has established discrete speech tokens as a foundational paradigm for speech representation....

1.8K views16:34

Speech Technology

1.8K views20:56

Speech Technology

I've spent some time on generative error correction recently, more numbers and results on it later, meanwhile the paper

https://arxiv.org/abs/2409.09785

Large Language Model Based Generative Error Correction: A Challenge and Baselines for Speech Recognition, Speaker Tagging, and Emotion Recognition

Chao-Han Huck Yang, Taejin Park, Yuan Gong, Yuanchao Li, Zhehuai Chen, Yen-Ting Lin, Chen Chen, Yuchen Hu, Kunal Dhawan, Piotr Żelasko, Chao Zhang, Yun-Nung Chen, Yu Tsao, Jagadeesh Balam, Boris Ginsburg, Sabato Marco Siniscalchi, Eng Siong Chng, Peter Bell, Catherine Lai, Shinji Watanabe, Andreas Stolcke

Given recent advances in generative AI technology, a key question is how large language models (LLMs) can enhance acoustic modeling tasks using text decoding results from a frozen, pretrained automatic speech recognition (ASR) model. To explore new capabilities in language modeling for speech processing, we introduce the generative speech transcription error correction (GenSEC) challenge. This challenge comprises three post-ASR language modeling tasks: (i) post-ASR transcription correction, (ii) speaker tagging, and (iii) emotion recognition. These tasks aim to emulate future LLM-based agents handling voice-based interfaces while remaining accessible to a broad audience by utilizing open pretrained language models or agent-based APIs. We also discuss insights from baseline evaluations, as well as lessons learned for designing future evaluations.

another older paper here

https://www.tg-me.com/speechtech/1962

arXiv.org

Large Language Model Based Generative Error Correction: A...

Given recent advances in generative AI technology, a key question is how large language models (LLMs) can enhance acoustic modeling tasks using text decoding results from a frozen, pretrained...

1.3K viewsedited 00:48

Speech Technology

https://github.com/soham97/mellow

https://arxiv.org/abs/2503.08540

Mellow: a small audio language model for reasoning

Soham Deshmukh, Satvik Dixit, Rita Singh, Bhiksha Raj

Multimodal Audio-Language Models (ALMs) can understand and reason over both audio and text. Typically, reasoning performance correlates with model size, with the best results achieved by models exceeding 8 billion parameters. However, no prior work has explored enabling small audio-language models to perform reasoning tasks, despite the potential applications for edge devices. To address this gap, we introduce Mellow, a small Audio-Language Model specifically designed for reasoning. Mellow achieves state-of-the-art performance among existing small audio-language models and surpasses several larger models in reasoning capabilities. For instance, Mellow scores 52.11 on MMAU, comparable to SoTA Qwen2 Audio (which scores 52.5) while using 50 times fewer parameters and being trained on 60 times less data (audio hrs). To train Mellow, we introduce ReasonAQA, a dataset designed to enhance audio-grounded reasoning in models. It consists of a mixture of existing datasets (30% of the data) and synthetically generated data (70%). The synthetic dataset is derived from audio captioning datasets, where Large Language Models (LLMs) generate detailed and multiple-choice questions focusing on audio events, objects, acoustic scenes, signal properties, semantics, and listener emotions. To evaluate Mellow's reasoning ability, we benchmark it on a diverse set of tasks, assessing on both in-distribution and out-of-distribution data, including audio understanding, deductive reasoning, and comparative reasoning. Finally, we conduct extensive ablation studies to explore the impact of projection layer choices, synthetic data generation methods, and language model pretraining on reasoning performance. Our training dataset, findings, and baseline pave the way for developing small ALMs capable of reasoning.

GitHub

GitHub - soham97/mellow: small audio language model for reasoning

small audio language model for reasoning. Contribute to soham97/mellow development by creating an account on GitHub.

1.5K views07:50

Speech Technology

ICASSP 2025 papers are available online now

https://ieeexplore.ieee.org/xpl/conhome/10887540/proceeding?isnumber=10887541

the program website is wierd

https://icassp25.conflux.events/program

icassp25.conflux.events

CONFlux: Your conference companion.

1.6K views15:55

Speech Technology

Interesting distillation of Kokoro

https://github.com/EndlessReform/smoltts

They also have speech dataset encoded with Mimi, LibriTTS-r is just 60Mb

GitHub

GitHub - EndlessReform/smoltts: Open TTS models, built for streaming on the edge

Open TTS models, built for streaming on the edge. Contribute to EndlessReform/smoltts development by creating an account on GitHub.

1.3K viewsedited 08:16

Speech Technology

Some more results from our experiments with GEC with LLMs

https://alphacephei.com/nsh/2025/03/15/generative-error-correction.html

Most 8B models at 4b quantization are not very stable, hallucinations present in about 25% cases. Qwen is very unstable for this task.

Gemma2 and Gemma3 are ok, yet to try 27B version.

Simple prompt from the papers certainly doesn’t work. One has to provide much more details and specific issues in prompt. We yet to work on the prompt more.

Even prompt formatting matters, by modifying the prompt format we were able to reduce WER from 26% to 16%

For now GEC doesn’t seem like a breakthrough tech, it seems like something like extra sause is needed, simple ROVER is equally ok and much more stable.

We discussed on the channel with iLa that English prompt helps for non-English language. I think it is possible for some models but I can’t confirm in experiments.

For big model input split doesn’t help much.

There are still a lot of overcorrection of proper names which are rare and unknown to LLM and overcorrection of grammar. We need to work more on it.

The difference between Gemma2-9B and Gemini Flash is not very large except for number of hallucinations.

Most models have very poor knowledge in rare domains and poor knowledge about speech (phonetics).

Speech Recognition With Vosk

Experiments with correction of speech recognition output with LLMs

Generative error correction is a thing recently, there are many papers on that, even a challenge:

1.9K viewsedited 13:32

Speech Technology

Twitter suggested me this paper on GEC stressing the named entity recognition issue, right on the subject:

https://arxiv.org/abs/2410.13198

arXiv.org

Failing Forward: Improving Generative Error Correction for ASR...

Generative Error Correction (GEC) has emerged as a powerful post-processing method to enhance the performance of Automatic Speech Recognition (ASR) systems. However, we show that GEC models...

1.5K views16:33

Speech Technology

https://github.com/canopyai/Orpheus-TTS

GitHub

GitHub - canopyai/Orpheus-TTS: Towards Human-Sounding Speech

Towards Human-Sounding Speech. Contribute to canopyai/Orpheus-TTS development by creating an account on GitHub.

1.3K views20:59

Speech Technology

https://x.com/PiotrZelasko/status/1902723841534681357

Canary-1B-Flash and Canary-180M-Flash - two new variants of Canary optimized for fast training and inference.

Key features of Canary-1B-Flash:
* Several times faster!
* More accurate than Canary-1B!
* Word-level timestamps!
* Dropped NC license!
Both models support the same set of languages as original Canary-1B: English, French, Spanish, and German.

X (formerly Twitter)

Piotr Żelasko (@PiotrZelasko) on X

Today we released @nvidia Canary-1B-Flash and Canary-180M-Flash - two new variants of Canary optimized for fast training and inference.

Key features of Canary-1B-Flash:
🔥 Several times faster!
📉 More accurate than Canary-1B!
🕰️ Word-level timestamps!
💰…

1.7K views14:14

Speech Technology

https://github.com/canopyai/Orpheus-TTS/issues/10#issuecomment-2740645470

christophschuhmann left a comment (canopyai/Orpheus-TTS#10)
Hey, Christoph from Laion here, the guy who made the Laion 5 billion data set. I have been making a voice acting data set with some donations from Intel with altogether 5,000 hours of high quality voice acting.
https://huggingface.co/datasets/laion/laions_got_talent_enhanced_flash_annotations_and_long_captions
https://huggingface.co/datasets/laion/laions_got_talent_raw
I was using HyperLab, which is a reseller for OpenAI API, so I never actually agreed to the OpenAI terms of service and then prompted the voice API to role play like an actor at a casting audition. This way I generated evenly distributed utterances over 40 emotion categories for all 11 OpenAI voices for English, German, French, and Spanish. The data is already online and I also have very detailed emotion captions. I will make an official release in the next few weeks, but you could already take it and tune German, Spanish, and French models on it. I would be very happy about a capable German model because I want to deploy voice assistants in schools in Germany. I'm doing all of this in my free time and I am still a high school teacher and want to keep it this way. In the following repository, the quality is the best, but unfortunately I lost the accent labels for the English samples. Some samples in the English part are with accents. In the second repository here, you find the unenhanced data, which is slightly lower from the recording quality, but you can find in the emotion entry of the JSON the corresponding accent. For English, I generated 14 different accents. German, Spanish, and French don't have any accents. Have fun!

GitHub

training / finetuning other languages · Issue #10 · canopyai/Orpheus-TTS

Amazing project! Would we be able to finetune on other languages? Could we just give finetune with a large dataset for another language or would it require other changes ? Would it handle other alp...

1.3K views13:29

Speech Technology

No day without new TTS

https://x.com/anuj_diwan/status/1902884487718965330

If you'd like an open-source text-to-speech model that follows your style instructions, consider using our ParaSpeechCaps-based model!
Model: https://huggingface.co/ajd12342/parler-tts-mini-v1-paraspeechcaps
Paper: https://arxiv.org/abs/2503.04713

X (formerly Twitter)

Anuj Diwan (@anuj_diwan) on X

If you'd like an open-source text-to-speech model that follows your style instructions, consider using our ParaSpeechCaps-based model!
Model: https://t.co/HCm71MW0aR
Paper: https://t.co/0DeVodn8SU

1.4K views22:01

Speech Technology

All comes from NLP

https://github.com/Bartelds/ctc-dro

https://arxiv.org/abs/2502.01777

CTC-DRO: Robust Optimization for Reducing Language Disparities in Speech Recognition

Martijn Bartelds, Ananjan Nandi, Moussa Koulako Bala Doumbouya, Dan Jurafsky, Tatsunori Hashimoto, Karen Livescu

Modern deep learning models often achieve high overall performance, but consistently fail on specific subgroups. Group distributionally robust optimization (group DRO) addresses this problem by minimizing the worst-group loss, but it fails when group losses misrepresent performance differences between groups. This is common in domains like speech, where the widely used connectionist temporal classification (CTC) loss scales with input length and varies with linguistic and acoustic properties, leading to spurious differences between group losses. We present CTC-DRO, which addresses the shortcomings of the group DRO objective by smoothing the group weight update to prevent overemphasis on consistently high-loss groups, while using input length-matched batching to mitigate CTC's scaling issues. We evaluate CTC-DRO on the task of multilingual automatic speech recognition (ASR) across five language sets from the ML-SUPERB 2.0 benchmark. CTC-DRO consistently outperforms group DRO and CTC-based baseline models, reducing the worst-language error by up to 47.1% and the average error by up to 32.9%. CTC-DRO can be applied to ASR with minimal computational costs, and offers the potential for reducing group disparities in other domains with similar challenges.

GitHub

GitHub - Bartelds/ctc-dro: Code associated with the paper: CTC-DRO: Robust Optimization for Reducing Language Disparities in Speech…

Code associated with the paper: CTC-DRO: Robust Optimization for Reducing Language Disparities in Speech Recognition. - Bartelds/ctc-dro

1.4K views13:19

Speech Technology

Couple new TTS today

https://github.com/yynil/RWKVTTS

https://github.com/yxlu-0102/IDEA-TTS

GitHub

GitHub - yynil/RWKVTTS: This project is to train an RWKV LLM for TTS generation which compatible to other TTS engine(like fish/cosy/chattts).

This project is to train an RWKV LLM for TTS generation which compatible to other TTS engine(like fish/cosy/chattts). - yynil/RWKVTTS

1.3K views22:28

Speech Technology

https://github.com/DataoceanAI/Dolphin

Dolphin is a multilingual, multitask ASR model developed through a collaboration between Dataocean AI and Tsinghua University. It supports 40 Eastern languages across East Asia, South Asia, Southeast Asia, and the Middle East, while also supporting 22 Chinese dialects. It is trained on over 210,000 hours of data, which includes both DataoceanAI's proprietary datasets and open-source datasets. The model can perform speech recognition, voice activity detection (VAD), segmentation, and language identification (LID).

Supports Russian, Uzbek, Kazakh, Tajik, etc

https://github.com/DataoceanAI/Dolphin/blob/main/languages.md

GitHub

GitHub - DataoceanAI/Dolphin: Dolphin is a multilingual, multitask ASR model jointly trained by DataoceanAI and Tsinghua University.

Dolphin is a multilingual, multitask ASR model jointly trained by DataoceanAI and Tsinghua University. - GitHub - DataoceanAI/Dolphin: Dolphin is a multilingual, multitask ASR model jointly traine...

1.6K viewsedited 22:31

Speech Technology

https://www.youtube.com/watch?v=2WLH-g4_xeA

YouTube

"Learning Source Disentanglement in Neural Audio Codec" - Xiaoyu Bie

Talk 13 of the Conversational AI Reading Group about "Making transformers work for audio coding" by Julian Parker

For further information about the Reading Group, please check out https://poonehmousavi.github.io/rg

1.6K views02:00

Speech Technology

https://github.com/bytedance/MegaTTS3

GitHub

GitHub - bytedance/MegaTTS3

Contribute to bytedance/MegaTTS3 development by creating an account on GitHub.

1.4K views18:12

Speech Technology

https://github.com/gwh22/UniVoice

This work introduces UniVoice, a novel approach that integrates autoregression and flow matching within a transformer-based framework for speech unified understanding and generation. UniVoice is designed to achieve both speech comprehension and generation capabilities through a unified model trained in a single stage. Our experiments demonstrate that UniVoice delivers strong performance for both automatic speech recognition and zero-shot speech synthesis tasks. By combining autoregression and flow matching, UniVoice establishes a foundation for expanding to additional audio understanding and generation tasks using the paradigm in the future.

GitHub

GitHub - gwh22/UniVoice

Contribute to gwh22/UniVoice development by creating an account on GitHub.

1.4K views15:39

Speech Technology

Announcing the AudioMOS Challenge 2025!

Homepage:https://sites.google.com/view/voicemos-challenge/audiomos-challenge-2025

We are enlarging the scope of the previous VoiceMOS challenge series to cover not only speech but also music and general audio.

Founded in 2022, the VoiceMOS Challenge (VMC) series aims to compare prediction techniques for human ratings of speech. To facilitate development in the automatic evaluation of audio generation systems, we decided to enlarge the scope and rename it as the AudioMOS Challenge.

Track 1: MOS prediction for text-to-music systems
This track is based on the MusicEval dataset, spanning 31 TTM systems, along with ratings collected from music experts. Evaluation was conducted across two axes: overall musical impression and alignment with the text prompt.

Track 2: Audiobox-aesthetics-style prediction for TTS, TTA and TTM samples
This track is based on the recently released Meta Audiobox Aesthetics, where they proposed four new axes: production quality, production complexity, content enjoyment, and content usefulness.

Track 3: MOS prediction for speech in high sampling frequencies
For the training set, we provide samples in 16/24/48kHz, and during evaluation, the participants are asked to evaluate samples reflecting their scores in a listening test that contains samples from all frqeuencies.

We are planning to submit a challenge proposal to ASRU2025. The challenge will start officially on April 9th. Please pre-register if interested!

Google

VoiceMOS Challenge - AudioMOS Challenge 2025

News

1.6K views16:32

2025/07/03 08:52:03
Back to Top

HTML Embed Code:

<iframe width="100%" src="https://www.bootg.com/buyppe/web?embed=1" title="Telegram Web" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>