Speech Technology

The paper itself

Using Voicebox-based Synthetic Speech for ASR Adaptation

https://www.isca-archive.org/syndata4genai_2024/dhamyal24_syndata4genai.pdf

1.4K viewsedited 01:12

Speech Technology

https://arxiv.org/abs/2411.18803

TS3-Codec: Transformer-Based Simple Streaming Single Codec

Haibin Wu, Naoyuki Kanda, Sefik Emre Eskimez, Jinyu Li

Neural audio codecs (NACs) have garnered significant attention as key technologies for audio compression as well as audio representation for speech language models. While mainstream NAC models are predominantly convolution-based, the performance of NACs with a purely transformer-based, and convolution-free architecture remains unexplored. This paper introduces TS3-Codec, a Transformer-Based Simple Streaming Single Codec. TS3-Codec consists of only a stack of transformer layers with a few linear layers, offering greater simplicity and expressiveness by fully eliminating convolution layers that require careful hyperparameter tuning and large computations. Under the streaming setup, the proposed TS3-Codec achieves comparable or superior performance compared to the codec with state-of-the-art convolution-based architecture while requiring only 12% of the computation and 77% of bitrate. Furthermore, it significantly outperforms the convolution-based codec when using similar computational resources.

arXiv.org

TS3-Codec: Transformer-Based Simple Streaming Single Codec

Neural audio codecs (NACs) have garnered significant attention as key technologies for audio compression as well as audio representation for speech language models. While mainstream NAC models are...

1.5K views00:57

Speech Technology

Includes ASR, yet to test it

https://developers.googleblog.com/en/introducing-gemma-3n/

Googleblog

Google for Developers Blog - News about Web, Mobile, AI and Cloud

Introducing Gemma 3n – the latest Google open model for accessible AI, featuring unique flexibility, privacy, and expanded multimodal capabilities on mobile devices.

1.3K views18:12

Speech Technology

So Kyutai released a modular system https://unmute.sh, basically admitted that their first demo is not really usable.

unmute.sh

Unmute by Kyutai

Make LLMs listen and speak.

1.4K views22:03

Speech Technology

Supports many languages, including Russian. No code/model yet though.

https://funaudiollm.github.io/cosyvoice3/

https://arxiv.org/abs/2505.17589

CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training

Zhihao Du, Changfeng Gao, Yuxuan Wang, Fan Yu, Tianyu Zhao, Hao Wang, Xiang Lv, Hui Wang, Xian Shi, Keyu An, Guanrou Yang, Yabin Li, Yanni Chen, Zhifu Gao, Qian Chen, Yue Gu, Mengzhe Chen, Yafeng Chen, Shiliang Zhang, Wen Wang, Jieping Ye

In our prior works, we introduced a scalable streaming speech synthesis model, CosyVoice 2, which integrates a large language model (LLM) and a chunk-aware flow matching (FM) model, and achieves low-latency bi-streaming speech synthesis and human-parity quality. Despite these advancements, CosyVoice 2 exhibits limitations in language coverage, domain diversity, data volume, text formats, and post-training techniques. In this paper, we present CosyVoice 3, an improved model designed for zero-shot multilingual speech synthesis in the wild, surpassing its predecessor in content consistency, speaker similarity, and prosody naturalness. Key features of CosyVoice 3 include: 1) A novel speech tokenizer to improve prosody naturalness, developed via supervised multi-task training, including automatic speech recognition, speech emotion recognition, language identification, audio event detection, and speaker analysis. 2) A new differentiable reward model for post-training applicable not only to CosyVoice 3 but also to other LLM-based speech synthesis models. 3) Dataset Size Scaling: Training data is expanded from ten thousand hours to one million hours, encompassing 9 languages and 18 Chinese dialects across various domains and text formats. 4) Model Size Scaling: Model parameters are increased from 0.5 billion to 1.5 billion, resulting in enhanced performance on our multilingual benchmark due to the larger model capacity. These advancements contribute significantly to the progress of speech synthesis in the wild. We encourage readers to listen to the demo at this https URL.

1.6K viewsedited 08:44

Speech Technology

Interesting point. We shouldn't test speech LLMs on factual knowledge

https://www.youtube.com/watch?v=2d1MU280yQk

https://github.com/slp-rl/WhiStress

https://arxiv.org/abs/2505.19103

WHISTRESS: Enriching Transcriptions with Sentence Stress Detection

Iddo Yosha, Dorin Shteyman, Yossi Adi

Spoken language conveys meaning not only through words but also through intonation, emotion, and emphasis. Sentence stress, the emphasis placed on specific words within a sentence, is crucial for conveying speaker intent and has been extensively studied in linguistics. In this work, we introduce WHISTRESS, an alignment-free approach for enhancing transcription systems with sentence stress detection. To support this task, we propose TINYSTRESS-15K, a scalable, synthetic training data for the task of sentence stress detection which resulted from a fully automated dataset creation process. We train WHISTRESS on TINYSTRESS-15K and evaluate it against several competitive baselines. Our results show that WHISTRESS outperforms existing methods while requiring no additional input priors during training or inference. Notably, despite being trained on synthetic data, WHISTRESS demonstrates strong zero-shot generalization across diverse benchmarks. Project page: this https URL.

YouTube

"On The Landscape of Spoken Language Models" - Yossi Adi.

Talk 22 of the Conversational AI Reading Group about "On The Landscape of Spoken Language Models" by Yossi Adi.

For further information about the Reading Group, please check out https://poonehmousavi.github.io/rg

1.3K views07:43

Speech Technology

TTS evaluation with audio LLM, interesting results too

https://github.com/boson-ai/EmergentTTS-Eval-public

GitHub

GitHub - boson-ai/EmergentTTS-Eval-public: Benchmark for evaluating TTS models on complex prosodic, expressiveness, and linguistic…

Benchmark for evaluating TTS models on complex prosodic, expressiveness, and linguistic challenges. - boson-ai/EmergentTTS-Eval-public

1.5K views04:12

Speech Technology

Discrete diffusion model from PlayHT

https://huggingface.co/PlayHT/PlayDiffusion

code here:

https://github.com/playht/PlayDiffusion

1.3K viewsedited 15:41

Speech Technology

In-car multispeaker recognition. CER is still 50%

https://github.com/DaiYvhang/AISHELL-5

GitHub

GitHub - DaiYvhang/AISHELL-5: In-car multi-channel speech transcription system of AISHELL-5.

In-car multi-channel speech transcription system of AISHELL-5. - DaiYvhang/AISHELL-5

910 views14:16

Speech Technology

Runtime quality control is interesting

https://www.linkedin.com/posts/yongyi-zang_github-resemble-aichatterbox-sota-open-source-activity-7333625257456480256-XfT8

Got very curious, so started to look into the source code of Resemble AI newly released open TTS model Chatterbox (https://lnkd.in/gzmCFFaQ) that claims to outperform ElevenLabs. Here's a (too quick, hopefully not wrong) tech deep-ish dive of its architecture:

High-level overview: text -> semantic tokens -> flow matching for Mel -> Mel to waveform. For voice conversion, speech -> semantic tokens, and the rest. Speaker embedding conditioning for text -> semantic tokens and semantic tokens -> Mel.

Given a reference audio, it is encoded to a speaker embedding network A (will explain in a second) and a speech tokenizer. The speech tokenizer looks very much like CosyVoice with some twists, is S3Tokenizer (basically just very semantic because trained on ASR objective).

These two embeddings are then sent into its core sequence-modeling model (a llama). CFG is applied by running within the same batch two times, one without conditioning speaker embedding. These two runs are then averaged to form final logits based on a weight. The text tokens and speech tokens have separate absolute positional embeddings. (Why not RoPE BTW?)

A runtime quality control model inspects alignment through looking into activation map of layer 9 of sequence modeling during generation (this is genius actually, they actually went in to look at attention maps!) and make sure the tokens are attending to the right words (final token is not being attended to for too long, previous tokens are not attended to again). It can't really fix anything, but it can make it immediately end speech by setting the <EOS> token's probability to ... 2^15...

So the speech tokens are generated by the sequence modeling, and a speaker embedding model B is used to extract the speaker embedding again and use that as condition for the vocoder. Remember speaker embedding network A? That's a GE2E (so basically 3 LSTMs). The network B is a CAM++ x-vector, if they didn't change it then likely a pre-trained model on AAM-Softmax objective. Why two embedding networks? I can't really figure this one out.

So the vocoder is a two stage system, they first generate Mel spectrogram then generate waveform from Mel. The Mel generation part is identical to CosyVoice, where a CFM model is used; They used HiFTNet instead of HiFiGAN as vocoder though.

HiFTNet is a neural source filter network that has several stages. It starts by predicting F0 from mel, then use these F0s to generate source signal. The source signal is then conditioned on the process of a inverse-STFT network (*neural filter*) where frame-wise magnitude and phase is predicted.

When the sequence modeling part is skipped, the speech tokens can also be directly extracted from source speech, which is then going through the decoder network.

The watermarking model is a separate model that runs on waveform-level. It adds a watermark into magnitude spectrogram, then use 3 branches of convolutions to model watermark presence at different time scales.

Got very curious, so started to look into the source code of Resemble AI newly released open TTS model Chatterbox (https://lnkd.in/gzmCFFaQ)…

Got very curious, so started to look into the source code of Resemble AI newly released open TTS model Chatterbox (https://lnkd.in/gzmCFFaQ) that claims to outperform ElevenLabs. Here's a (too quick, hopefully not wrong) tech deep-ish dive of its architecture:…

997 viewsedited 16:50

Speech Technology

We speech guys rarely eat dogfoot ;) Few interesting dictation tools popular these days:

1) Willowvoice
2) Wisprflow
3) Superwhisper

It is interesting how LLM correction plays here, you don't need plain transcript, instead it converts your inputs to required style.

https://willowvoice.com/

https://wisprflow.ai/

https://superwhisper.com/

Willowvoice

Voice Dictation Powered by AI | Willow Voice

Fast, accurate voice dictation for emails, documents, Cursor, note-taking, and messaging. Turn speech to text seamlessly, edit automatically, and stay secure with privacy. Boost productivity with context-aware AI and custom dictionaries. Get Willow Voice!

683 views08:34

Speech Technology

As part of our mission to create open-source datasets for low-resource African languages, Digital Umuganda has released 2,250 hours of open-source Kinyarwanda speech data. To accompany this release, we launched an ASR hackathon on Kaggle, inviting the ecosystem to build models and contribute to shaping the future of low resource language technologies.

Our goal is to collect 10,000 hours each of Kinyarwanda and Swahili speech data. This hackathon is a crucial step in that journey. The feedback will help us refine our data collection strategy for the remaining hours and ensure the datasets meet the needs of developers, researchers, and language advocates across the region.

We would greatly appreciate it if you could share this initiative with your network and help us reach more contributors passionate about language, technology, and open data.

The hackathon is made of 3 tracks

Track A – Small: 540 hours of fully transcribed Kinyarwanda speech.

Track B – Medium: 1180 hours of fully transcribed Kinyarwanda speech.

Track C – Large: 1180 hours of transcribed speech plus 1170 hours of unlabeled Kinyarwanda audio.

For more information you can check the Hackathon website https://digital-umuganda.github.io/kasr_hackathon/

694 views09:49

Speech Technology

https://www.linkedin.com/posts/alexander-polok-b5567284_dicow-diarization-conditioned-whisper-for-activity-7341058825732415488-UVez

We are happy to announce that our DiCoW and DiariZen based system finished 🥈 in the Challenge and Workshop on Multilingual Conversational Speech Language Model (MLC-SLM) at Interspeech 2025 organized by Nexdata.jp【旧Datatang株式会社公式】!

https://www.nexdata.ai/competition/mlc-slm

📄 System description and additional analysis (including dataset inconsistencies) are now available on arXiv:
BUT System for the MLC-SLM Challenge:
👉 https://www.arxiv.org/abs/2506.13414

In addition, I’m very pleased to share that our journal paper:
"DiCoW: Diarization-Conditioned Whisper for Target Speaker Automatic Speech Recognition"
👉https://www.sciencedirect.com/science/article/abs/pii/S088523082500066X
has been accepted for publication in Computer Speech & Language (Elsevier)!

And last but not least — just yesterday I had the pleasure of presenting a tutorial and mini-challenge on fine-tuning DiCoW in data/compute constrained environments at this year's JSALT Summer School!
🎓 If you want to try it yourself: https://colab.research.google.com/github/Lakoc/JSALT_tutorial/blob/main/challenge.ipynb

https://huggingface.co/spaces/BUT-FIT/EMMA_leaderboard

🎥 Recording available here: https://www.youtube.com/watch?v=KqNKGjcsi9g&list=PLSeS0sl8xpTwz7h5iJSniiF89iUdZXNJ2&index=28

🎉 Exciting updates to share! | Alexander Polok

🎉 Exciting updates to share! 🎉
We are happy to announce that our DiCoW and DiariZen based system finished 🥈 in the Challenge and Workshop on Multilingual Conversational Speech Language Model (MLC-SLM) at Interspeech 2025 organized by Nexdata.jp【旧Datatang株式会社公式】!…

649 viewsedited 14:23

Speech Technology

Whole playlist of JSALT 2025 videos

https://www.youtube.com/playlist?list=PLSeS0sl8xpTwz7h5iJSniiF89iUdZXNJ2

YouTube

JSALT 2025

Share your videos with friends, family, and the world

669 views14:25

Speech Technology

https://x.com/kyutai_labs/status/1935652243119788111

Kyutai Speech-To-Text is now open-source! It’s streaming, supports batched inference, and runs blazingly fast: perfect for interactive applications.

Check out the details here: https://kyutai.org/next/stt

https://github.com/kyutai-labs/delayed-streams-modeling

X (formerly Twitter)

kyutai (@kyutai_labs) on X

Kyutai Speech-To-Text is now open-source! It’s streaming, supports batched inference, and runs blazingly fast: perfect for interactive applications.
Check out the details here: https://t.co/bQMP56XaKC

702 views13:43

Speech Technology

LAION proudly presents 2 state-of-the-art emotion detection models for voice and face, surpassing Gemini 2.5 Pro and Hume API. They are completely open under a CC BY 4.0 license, alongside a ~5,000-hour voice-acting dataset & 2 expert-annotated benchmarks.

https://huggingface.co/laion/BUD-E-Whisper

https://arxiv.org/abs/2505.20033

huggingface.co

laion/BUD-E-Whisper · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

726 views22:01

Speech Technology

https://github.com/fluxions-ai/vui
https://huggingface.co/fluxions/vui

got some attention recently. Multispeaker TTS model with context (like Dia) 100m params

DIA vs vui

- vui 16x smaller
- Unlimited render length, dia 30 seconds
- vui has 150ms latency, time to first byte
- vui runs in <5gb VRAM
- 4x faster codec
- 1/2 the number of people
- built with google cloud tpus, vs two 4090's in a basement.
- 7x faster RTF

GitHub

GitHub - fluxions-ai/vui

Contribute to fluxions-ai/vui development by creating an account on GitHub.

562 viewsedited 13:44

Speech Technology

We track InWorld company status as it was founded by Dialogflow guys (it was very popular those days). Interesting that AI for games didn't work

https://www.linkedin.com/posts/kylangibbs_inworld-is-evolving-1-we-just-published-activity-7341215644828188672-aYWf

Inworld is evolving.

1. We just published our vision of the future. These are distilled learnings based on our first 4 years engaged with partners like NVIDIA, Microsoft, Status, Little Umbrella, Streamlabs, Nanobit, NBCUniversal, Mistral AI, Google and thousands of other developers.

Due to explosive growth in demand we are widening our focus to broader consumer applications (extending from games into new areas like fitness, learning and social connection). We are seeing new and existing companies across consumer categories shift the focus of AI adoption from cost savings to net new revenue opportunities through novel AI-native applications, and we are leaning in to support that shift.

Inworld is evolving. | Kylan Gibbs

Inworld is evolving.

1. We just published our vision of the future. These are distilled learnings based on our first 4 years engaged with partners like NVIDIA, Microsoft, Status, Little Umbrella, Streamlabs, Nanobit, NBCUniversal, Mistral AI, Google and…

581 views14:08

Speech Technology

Tested https://huggingface.co/kyutai/stt-1b-en_fr model on some diverse data. Accuracy is on the lower side.

CMUKids WER is 11.3 for example compared to 4.8 for parakeet-tdt-0.6b-v2. Librispeech test-clean WER is 4+ too.

Output sometimes Chinese, sometimes Arabic.

huggingface.co

kyutai/stt-1b-en_fr · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

821 viewsedited 07:53

Speech Technology

Supports speech recognition

https://huggingface.co/blog/gemma3n

huggingface.co

Gemma 3n fully available in the open-source ecosystem!

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

174 viewsedited 21:06

2025/06/26 22:47:58
Back to Top

HTML Embed Code:

<iframe width="100%" src="https://www.bootg.com/buyppe/web?embed=1" title="Telegram Web" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>