Speech Technology

Talk 4 of the Conversational AI Reading Group about "Neural Audio Codecs in the Era of Speech LMs" by Haibin Wu.

For further information about the Reading Group, please check out https://poonehmousavi.github.io/rg

1.9K views00:42

Speech Technology

We tried discrete loss for duration from StyleTTS2 in MatchaTTS, it is really good

https://alphacephei.com/nsh/2025/01/12/discrete-units.html

Speech Recognition With Vosk

Why discrete units

Discrete units made a splash since Hubert probably (2021, four years already), then with Tortoise TTS and successors. Before that there were many attempts too, like the very old system by our respected colleagues Jan Cernocky, Genevieve Baudoin, Gerard Chollet…

2.2K viewsedited 22:09

Speech Technology

Guided sampling helps to reduce artifacts and improve clarity. It also significantly reduces expressiveness. However, one can see that simply reducing temperature has similar effect with less compute.

https://alphacephei.com/nsh/2025/01/17/guidance.html

1.8K views13:58

Speech Technology

Exactly 20 years ago we started our first project in speech, a voice for Festival TTS. Many things happened since then but it was a great story. Looking for the next 20 years now.

https://www.linux.org.ru/news/linux-general/775065?cid=776417

www.linux.org.ru

Поговори со мной!

Заметка о том как научить Linux говорить по-человечески голосом с помощю системы распознавания речи Сфинкс-2.

1.6K views21:24

Speech Technology

Once again (third time) https://github.com/KdaiP/StableTTS is really good.

It is all about conditioning. Many words in the paper, but this picture is the main one.

2.0K views01:12

Speech Technology

https://github.com/FireRedTeam/FireRedASR

FireRedASR is a family of large-scale automatic speech recognition (ASR) models supporting Mandarin, Chinese dialects and English, while also offering singing lyrics recognition capability, achieving a new state-of-the-art on public Mandarin ASR benchmarks.

FireRedASR is designed to meet diverse requirements in superior performance and optimal efficiency across various applications. It comprises two variants:

FireRedASR-LLM: Designed to achieve state-of-the-art (SOTA) performance and to enable seamless end-to-end speech interaction. It adopts an Encoder-Adapter-LLM framework leveraging large language model (LLM) capabilities.

FireRedASR-AED: Designed to balance high performance and computational efficiency and to serve as an effective speech representation module in LLM-based speech models. It utilizes an Attention-based Encoder-Decoder (AED) architecture.

https://arxiv.org/pdf/2501.14350

1.8K viewsedited 11:18

Speech Technology

1M hours 48TB

https://mlcommons.org/2025/01/new-unsupervised-peoples-speech/

MLCommons

Unsupervised People's Speech: A Massive Multilingual Audio Dataset - MLCommons

The MLCommons Dataset working group is pleased to announce the release of the Unsupervised People's Speech dataset. Built-in collaboration with HuggingFace, the Unsupervised People’s Speech dataset contains over 1 million hours of audio spanning dozens of…

1.7K viewsedited 13:20

Speech Technology

Great speeds

https://arxiv.org/abs/2406.08835

EffectiveASR: A Single-Step Non-Autoregressive Mandarin Speech Recognition Architecture with High Accuracy and Inference Speed

Ziyang Zhuang, Chenfeng Miao, Kun Zou, Ming Fang, Tao Wei, Zijian Li, Ning Cheng, Wei Hu, Shaojun Wang, Jing Xiao

Non-autoregressive (NAR) automatic speech recognition (ASR) models predict tokens independently and simultaneously, bringing high inference speed. However, there is still a gap in the accuracy of the NAR models compared to the autoregressive (AR) models. In this paper, we propose a single-step NAR ASR architecture with high accuracy and inference speed, called EffectiveASR. It uses an Index Mapping Vector (IMV) based alignment generator to generate alignments during training, and an alignment predictor to learn the alignments for inference. It can be trained end-to-end (E2E) with cross-entropy loss combined with alignment loss. The proposed EffectiveASR achieves competitive results on the AISHELL-1 and AISHELL-2 Mandarin benchmarks compared to the leading models. Specifically, it achieves character error rates (CER) of 4.26%/4.62% on the AISHELL-1 dev/test dataset, which outperforms the AR Conformer with about 30x inference speedup.

arXiv.org

EffectiveASR: A Single-Step Non-Autoregressive Mandarin Speech...

Non-autoregressive (NAR) automatic speech recognition (ASR) models predict tokens independently and simultaneously, bringing high inference speed. However, there is still a gap in the accuracy of...

1.4K views14:57

Speech Technology

https://github.com/yangb05/PengChengStarling

GitHub

GitHub - yangb05/PengChengStarling: PengChengStarling is specifically designed for developing multilingual ASR models based on…

PengChengStarling is specifically designed for developing multilingual ASR models based on the icefall project, supporting a complete ASR pipeline that includes data processing, model training, inf...

1.8K views21:26

Speech Technology

https://arxiv.org/abs/2502.05232

Aligner-Encoders: Self-Attention Transformers Can Be Self-Transducers

Adam Stooke, Rohit Prabhavalkar, Khe Chai Sim, Pedro Moreno Mengibar

Modern systems for automatic speech recognition, including the RNN-Transducer and Attention-based Encoder-Decoder (AED), are designed so that the encoder is not required to alter the time-position of information from the audio sequence into the embedding; alignment to the final text output is processed during decoding. We discover that the transformer-based encoder adopted in recent years is actually capable of performing the alignment internally during the forward pass, prior to decoding. This new phenomenon enables a simpler and more efficient model, the "Aligner-Encoder". To train it, we discard the dynamic programming of RNN-T in favor of the frame-wise cross-entropy loss of AED, while the decoder employs the lighter text-only recurrence of RNN-T without learned cross-attention -- it simply scans embedding frames in order from the beginning, producing one token each until predicting the end-of-message. We conduct experiments demonstrating performance remarkably close to the state of the art, including a special inference configuration enabling long-form recognition. In a representative comparison, we measure the total inference time for our model to be 2x faster than RNN-T and 16x faster than AED. Lastly, we find that the audio-text alignment is clearly visible in the self-attention weights of a certain layer, which could be said to perform "self-transduction".

arXiv.org

Aligner-Encoders: Self-Attention Transformers Can Be Self-Transducers

Modern systems for automatic speech recognition, including the RNN-Transducer and Attention-based Encoder-Decoder (AED), are designed so that the encoder is not required to alter the time-position...

1.6K views14:07

Speech Technology

https://huggingface.co/spaces/Speech-Arena-2025/Speech-DF-Arena

huggingface.co

Speech DF Arena - a Hugging Face Space by Speech-Arena-2025

Browse a leaderboard of speech deep fake detection systems, view performance metrics, and submit your own system for evaluation. The app uses data from the ASVspoof datasets to compare system accur...

1.8K views23:31

Speech Technology

https://github.com/JusperLee/TIGER

Demos are pretty nice (video part)

https://cslikai.cn/TIGER/

GitHub

GitHub - JusperLee/TIGER: TIGER: Time-frequency Interleaved Gain Extraction and Reconstruction for Efficient Speech Separation

TIGER: Time-frequency Interleaved Gain Extraction and Reconstruction for Efficient Speech Separation - JusperLee/TIGER

1.4K views23:31

Speech Technology

This thing was recently releases, somehow missed it before

Sortformer diarizer: an open-source, end-to-end neural model for speaker diarization.

- Integration with ASR and LLM Models: Sortformer is designed to be integrated with ASR or LLM models as a Transformer Encoder. It can be used to inject token-level speaker ID info into the encoder parts of ASR models and LLMs.
- Train/Fine-tune via Token-level Labels: Sortformer resolves the permutation problem using arrival-time sort-loss-based training, enabling speaker IDs for words to be trained via token-level labels. No more timestamp-based training for speaker diarization!

https://arxiv.org/abs/2409.06656

https://huggingface.co/nvidia/diar_sortformer_4spk-v1

arXiv.org

Sortformer: Seamless Integration of Speaker Diarization and ASR by...

We propose Sortformer, a novel neural model for speaker diarization, trained with unconventional objectives compared to existing end-to-end diarization models. The permutation problem in speaker...

1.7K views23:52

Speech Technology

https://huggingface.co/datasets/KBLab/rixvox-v2

23k hours of Swedish speech. These guys also release Whisper tunes

https://huggingface.co/KBLab

1.2K views21:07

Speech Technology

KAD: No More FAD! An Effective and Efficient Evaluation Metric for Audio Generation

https://arxiv.org/abs/2502.15602

Although being widely adopted for evaluating generated audio signals, the Fréchet Audio Distance (FAD) suffers from significant limitations, including reliance on Gaussian assumptions, sensitivity to sample size, and high computational complexity. As an alternative, we introduce the Kernel Audio Distance (KAD), a novel, distribution-free, unbiased, and computationally efficient metric based on Maximum Mean Discrepancy (MMD). Through analysis and empirical validation, we demonstrate KAD's advantages: (1) faster convergence with smaller sample sizes, enabling reliable evaluation with limited data; (2) lower computational cost, with scalable GPU acceleration; and (3) stronger alignment with human perceptual judgments. By leveraging advanced embeddings and characteristic kernels, KAD captures nuanced differences between real and generated audio. Open-sourced in the kadtk toolkit, KAD provides an efficient, reliable, and perceptually aligned benchmark for evaluating generative audio models.

https://github.com/YoonjinXD/kadtk

arXiv.org

KAD: No More FAD! An Effective and Efficient Evaluation Metric for...

Although being widely adopted for evaluating generated audio signals, the Fréchet Audio Distance (FAD) suffers from significant limitations, including reliance on Gaussian assumptions,...

1.3K views22:52

Speech Technology

We all know that distillation is more efficient than training from scratch, so the paper is not very insightful, but it is interesting where it all goes.

https://pages.cs.huji.ac.il/adiyoss-lab/slamming/

https://arxiv.org/abs/2502.15814

Slamming: Training a Speech Language Model on One GPU in a Day

Gallil Maimon, Avishai Elmakies, Yossi Adi

We introduce Slam, a recipe for training high-quality Speech Language Models (SLMs) on a single academic GPU in 24 hours. We do so through empirical analysis of model initialisation and architecture, synthetic training data, preference optimisation with synthetic data and tweaking all other components.
...

1.5K views07:35

Speech Technology

Introducing Emilia-Large: 200K+ Hours of Open-Source Speech Data!
We’re excited to release Emilia-Large, the largest TTS pretraining datasets! With 200K+ hours of multilingual speech data, fully open-source. It is ready to use for #TTS and #SpeechLM.

https://x.com/realamphion/status/1894719602816393295

1.5K views14:00

Speech Technology

Multimodal LLM Phi-4 from Microsoft, good benchmarks on speech

https://huggingface.co/microsoft/Phi-4-multimodal-instruct

1.9K viewsedited 08:50

Speech Technology

No paper yet but samples sound nice

https://sparkaudio.github.io/spark-tts/

Spark-TTS, a novel system built upon our proposed BiCodec, a single-stream speech codec that strategically decomposes speech into two complementary token types: low-bitrate semantic tokens for linguistic content and fixed-length global tokens for speaker-specific attributes. This disentangled representation, combined with the Qwen2.5 LLM and a chain-of-thought (CoT) generation approach, enables both coarse-grained attribute control (e.g., gender, pitch level) and fine-grained parameter adjustment (e.g., precise pitch values, speaking rate).

1.9K viewsedited 02:10

Speech Technology

Somewhat interesting in-depth thing on optimizing Whisper. Another bit of whisperology

https://github.com/efeslab/LiteASR

GitHub

GitHub - efeslab/LiteASR: LiteASR: Efficient Automatic Speech Recognition with Low-Rank Approximation

LiteASR: Efficient Automatic Speech Recognition with Low-Rank Approximation - efeslab/LiteASR

1.4K views13:41

2025/07/04 09:00:22
Back to Top

HTML Embed Code:

<iframe width="100%" src="https://www.bootg.com/buyppe/web?embed=1" title="Telegram Web" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>