The paper itself

Using Voicebox-based Synthetic Speech for ASR Adaptation

https://www.isca-archive.org/syndata4genai_2024/dhamyal24_syndata4genai.pdf
https://arxiv.org/abs/2411.18803

TS3-Codec: Transformer-Based Simple Streaming Single Codec

Haibin Wu, Naoyuki Kanda, Sefik Emre Eskimez, Jinyu Li

Neural audio codecs (NACs) have garnered significant attention as key technologies for audio compression as well as audio representation for speech language models. While mainstream NAC models are predominantly convolution-based, the performance of NACs with a purely transformer-based, and convolution-free architecture remains unexplored. This paper introduces TS3-Codec, a Transformer-Based Simple Streaming Single Codec. TS3-Codec consists of only a stack of transformer layers with a few linear layers, offering greater simplicity and expressiveness by fully eliminating convolution layers that require careful hyperparameter tuning and large computations. Under the streaming setup, the proposed TS3-Codec achieves comparable or superior performance compared to the codec with state-of-the-art convolution-based architecture while requiring only 12% of the computation and 77% of bitrate. Furthermore, it significantly outperforms the convolution-based codec when using similar computational resources.
So Kyutai released a modular system https://unmute.sh, basically admitted that their first demo is not really usable.
Supports many languages, including Russian. No code/model yet though.

https://funaudiollm.github.io/cosyvoice3/

https://arxiv.org/abs/2505.17589

CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training

Zhihao Du, Changfeng Gao, Yuxuan Wang, Fan Yu, Tianyu Zhao, Hao Wang, Xiang Lv, Hui Wang, Xian Shi, Keyu An, Guanrou Yang, Yabin Li, Yanni Chen, Zhifu Gao, Qian Chen, Yue Gu, Mengzhe Chen, Yafeng Chen, Shiliang Zhang, Wen Wang, Jieping Ye

In our prior works, we introduced a scalable streaming speech synthesis model, CosyVoice 2, which integrates a large language model (LLM) and a chunk-aware flow matching (FM) model, and achieves low-latency bi-streaming speech synthesis and human-parity quality. Despite these advancements, CosyVoice 2 exhibits limitations in language coverage, domain diversity, data volume, text formats, and post-training techniques. In this paper, we present CosyVoice 3, an improved model designed for zero-shot multilingual speech synthesis in the wild, surpassing its predecessor in content consistency, speaker similarity, and prosody naturalness. Key features of CosyVoice 3 include: 1) A novel speech tokenizer to improve prosody naturalness, developed via supervised multi-task training, including automatic speech recognition, speech emotion recognition, language identification, audio event detection, and speaker analysis. 2) A new differentiable reward model for post-training applicable not only to CosyVoice 3 but also to other LLM-based speech synthesis models. 3) Dataset Size Scaling: Training data is expanded from ten thousand hours to one million hours, encompassing 9 languages and 18 Chinese dialects across various domains and text formats. 4) Model Size Scaling: Model parameters are increased from 0.5 billion to 1.5 billion, resulting in enhanced performance on our multilingual benchmark due to the larger model capacity. These advancements contribute significantly to the progress of speech synthesis in the wild. We encourage readers to listen to the demo at this https URL.
Interesting point. We shouldn't test speech LLMs on factual knowledge

https://www.youtube.com/watch?v=2d1MU280yQk

https://github.com/slp-rl/WhiStress

https://arxiv.org/abs/2505.19103

WHISTRESS: Enriching Transcriptions with Sentence Stress Detection

Iddo Yosha, Dorin Shteyman, Yossi Adi

Spoken language conveys meaning not only through words but also through intonation, emotion, and emphasis. Sentence stress, the emphasis placed on specific words within a sentence, is crucial for conveying speaker intent and has been extensively studied in linguistics. In this work, we introduce WHISTRESS, an alignment-free approach for enhancing transcription systems with sentence stress detection. To support this task, we propose TINYSTRESS-15K, a scalable, synthetic training data for the task of sentence stress detection which resulted from a fully automated dataset creation process. We train WHISTRESS on TINYSTRESS-15K and evaluate it against several competitive baselines. Our results show that WHISTRESS outperforms existing methods while requiring no additional input priors during training or inference. Notably, despite being trained on synthetic data, WHISTRESS demonstrates strong zero-shot generalization across diverse benchmarks. Project page: this https URL.
Runtime quality control is interesting

https://www.linkedin.com/posts/yongyi-zang_github-resemble-aichatterbox-sota-open-source-activity-7333625257456480256-XfT8

Got very curious, so started to look into the source code of Resemble AI newly released open TTS model Chatterbox (https://lnkd.in/gzmCFFaQ) that claims to outperform ElevenLabs. Here's a (too quick, hopefully not wrong) tech deep-ish dive of its architecture:

High-level overview: text -> semantic tokens -> flow matching for Mel -> Mel to waveform. For voice conversion, speech -> semantic tokens, and the rest. Speaker embedding conditioning for text -> semantic tokens and semantic tokens -> Mel.

Given a reference audio, it is encoded to a speaker embedding network A (will explain in a second) and a speech tokenizer. The speech tokenizer looks very much like CosyVoice with some twists, is S3Tokenizer (basically just very semantic because trained on ASR objective).

These two embeddings are then sent into its core sequence-modeling model (a llama). CFG is applied by running within the same batch two times, one without conditioning speaker embedding. These two runs are then averaged to form final logits based on a weight. The text tokens and speech tokens have separate absolute positional embeddings. (Why not RoPE BTW?)

A runtime quality control model inspects alignment through looking into activation map of layer 9 of sequence modeling during generation (this is genius actually, they actually went in to look at attention maps!) and make sure the tokens are attending to the right words (final token is not being attended to for too long, previous tokens are not attended to again). It can't really fix anything, but it can make it immediately end speech by setting the <EOS> token's probability to ... 2^15...

So the speech tokens are generated by the sequence modeling, and a speaker embedding model B is used to extract the speaker embedding again and use that as condition for the vocoder. Remember speaker embedding network A? That's a GE2E (so basically 3 LSTMs). The network B is a CAM++ x-vector, if they didn't change it then likely a pre-trained model on AAM-Softmax objective. Why two embedding networks? I can't really figure this one out.

So the vocoder is a two stage system, they first generate Mel spectrogram then generate waveform from Mel. The Mel generation part is identical to CosyVoice, where a CFM model is used; They used HiFTNet instead of HiFiGAN as vocoder though.

HiFTNet is a neural source filter network that has several stages. It starts by predicting F0 from mel, then use these F0s to generate source signal. The source signal is then conditioned on the process of a inverse-STFT network (*neural filter*) where frame-wise magnitude and phase is predicted.

When the sequence modeling part is skipped, the speech tokens can also be directly extracted from source speech, which is then going through the decoder network.

The watermarking model is a separate model that runs on waveform-level. It adds a watermark into magnitude spectrogram, then use 3 branches of convolutions to model watermark presence at different time scales.
We speech guys rarely eat dogfoot ;) Few interesting dictation tools popular these days:

1) Willowvoice
2) Wisprflow
3) Superwhisper

It is interesting how LLM correction plays here, you don't need plain transcript, instead it converts your inputs to required style.

https://willowvoice.com/

https://wisprflow.ai/

https://superwhisper.com/
As part of our mission to create open-source datasets for low-resource African languages, Digital Umuganda has released 2,250 hours of open-source Kinyarwanda speech data. To accompany this release, we launched an ASR hackathon on Kaggle, inviting the ecosystem to build models and contribute to shaping the future of low resource language technologies.

Our goal is to collect 10,000 hours each of Kinyarwanda and Swahili speech data. This hackathon is a crucial step in that journey. The feedback will help us refine our data collection strategy for the remaining hours and ensure the datasets meet the needs of developers, researchers, and language advocates across the region.

We would greatly appreciate it if you could share this initiative with your network and help us reach more contributors passionate about language, technology, and open data.

The hackathon is made of 3 tracks

Track A – Small: 540 hours of fully transcribed Kinyarwanda speech.

Track B – Medium: 1180 hours of fully transcribed Kinyarwanda speech.

Track C – Large: 1180 hours of transcribed speech plus 1170 hours of unlabeled Kinyarwanda audio.

For more information you can check the Hackathon website https://digital-umuganda.github.io/kasr_hackathon/
https://www.linkedin.com/posts/alexander-polok-b5567284_dicow-diarization-conditioned-whisper-for-activity-7341058825732415488-UVez

We are happy to announce that our DiCoW and DiariZen based system finished 🥈 in the Challenge and Workshop on Multilingual Conversational Speech Language Model (MLC-SLM) at Interspeech 2025 organized by Nexdata.jp【旧Datatang株式会社公式】!

https://www.nexdata.ai/competition/mlc-slm

📄 System description and additional analysis (including dataset inconsistencies) are now available on arXiv:
BUT System for the MLC-SLM Challenge:
👉 https://www.arxiv.org/abs/2506.13414

In addition, I’m very pleased to share that our journal paper:
"DiCoW: Diarization-Conditioned Whisper for Target Speaker Automatic Speech Recognition"
👉https://www.sciencedirect.com/science/article/abs/pii/S088523082500066X
has been accepted for publication in Computer Speech & Language (Elsevier)!

And last but not least — just yesterday I had the pleasure of presenting a tutorial and mini-challenge on fine-tuning DiCoW in data/compute constrained environments at this year's JSALT Summer School!
🎓 If you want to try it yourself: https://colab.research.google.com/github/Lakoc/JSALT_tutorial/blob/main/challenge.ipynb

https://huggingface.co/spaces/BUT-FIT/EMMA_leaderboard

🎥 Recording available here: https://www.youtube.com/watch?v=KqNKGjcsi9g&list=PLSeS0sl8xpTwz7h5iJSniiF89iUdZXNJ2&index=28
LAION proudly presents 2 state-of-the-art emotion detection models for voice and face, surpassing Gemini 2.5 Pro and Hume API. They are completely open under a CC BY 4.0 license, alongside a ~5,000-hour voice-acting dataset & 2 expert-annotated benchmarks.

https://huggingface.co/laion/BUD-E-Whisper

https://arxiv.org/abs/2505.20033
https://github.com/fluxions-ai/vui
https://huggingface.co/fluxions/vui

got some attention recently. Multispeaker TTS model with context (like Dia) 100m params

DIA vs vui

- vui 16x smaller
- Unlimited render length, dia 30 seconds
- vui has 150ms latency, time to first byte
- vui runs in <5gb VRAM
- 4x faster codec
- 1/2 the number of people
- built with google cloud tpus, vs two 4090's in a basement.
- 7x faster RTF
We track InWorld company status as it was founded by Dialogflow guys (it was very popular those days). Interesting that AI for games didn't work

https://www.linkedin.com/posts/kylangibbs_inworld-is-evolving-1-we-just-published-activity-7341215644828188672-aYWf

Inworld is evolving.

1. We just published our vision of the future. These are distilled learnings based on our first 4 years engaged with partners like NVIDIA, Microsoft, Status, Little Umbrella, Streamlabs, Nanobit, NBCUniversal, Mistral AI, Google and thousands of other developers.

Due to explosive growth in demand we are widening our focus to broader consumer applications (extending from games into new areas like fitness, learning and social connection). We are seeing new and existing companies across consumer categories shift the focus of AI adoption from cost savings to net new revenue opportunities through novel AI-native applications, and we are leaning in to support that shift.
Tested https://huggingface.co/kyutai/stt-1b-en_fr model on some diverse data. Accuracy is on the lower side.

CMUKids WER is 11.3 for example compared to 4.8 for parakeet-tdt-0.6b-v2. Librispeech test-clean WER is 4+ too.

Output sometimes Chinese, sometimes Arabic.
2025/06/26 22:47:58
Back to Top
HTML Embed Code: