For crypto guys, projects to finetune different TTS models
https://github.com/impel-intelligence/dippy-speech-subnet
https://github.com/myshell-ai/MyShell-TTS-Subnet
https://github.com/impel-intelligence/dippy-speech-subnet
https://github.com/myshell-ai/MyShell-TTS-Subnet
GitHub
GitHub - impel-intelligence/dippy-speech-subnet: Dippy Synthetic Speech Subnet
Dippy Synthetic Speech Subnet. Contribute to impel-intelligence/dippy-speech-subnet development by creating an account on GitHub.
https://x.com/LiuXub/status/1863622470709690575
TAAE — the first Transformer-based Audio AutoEncoder scaled to 1B parameters for neural speech coding! 🔥
TAAE achieves state-of-the-art speech quality at ultra-low bitrates of 400 or 700 bits-per-second, delivering reconstruction quality remarkably close to real audio. It sets a new benchmark for efficient and high-quality speech tokenization.
📖 Paper: https://arxiv.org/abs/2411.19842v1
👂 Demos: https://stability-ai.github.io/stable-codec-demo/
💻 GitHub: https://github.com/Stability-AI/stable-codec
Code and pre-trained models will be released to empower the community!
TAAE — the first Transformer-based Audio AutoEncoder scaled to 1B parameters for neural speech coding! 🔥
TAAE achieves state-of-the-art speech quality at ultra-low bitrates of 400 or 700 bits-per-second, delivering reconstruction quality remarkably close to real audio. It sets a new benchmark for efficient and high-quality speech tokenization.
📖 Paper: https://arxiv.org/abs/2411.19842v1
👂 Demos: https://stability-ai.github.io/stable-codec-demo/
💻 GitHub: https://github.com/Stability-AI/stable-codec
Code and pre-trained models will be released to empower the community!
arXiv.org
Scaling Transformers for Low-Bitrate High-Quality Speech Coding
The tokenization of speech with neural audio codec models is a vital part of modern AI pipelines for the generation or understanding of speech, alone or in a multimodal context. Traditionally such...
Indic Parler-TTS is a multilingual Indic extension of Parler-TTS Mini.
It is a fine-tuned version of Indic Parler-TTS Pretrained, trained on a 1,806 hours multilingual Indic and English dataset.
Indic Parler-TTS Mini can officially speak in 20 Indic languages, making it comprehensive for regional language technologies, and in English. The 21 languages supported are: Assamese, Bengali, Bodo, Dogri, English, Gujarati, Hindi, Kannada, Konkani, Maithili, Malayalam, Manipuri, Marathi, Nepali, Odia, Sanskrit, Santali, Sindhi, Tamil, Telugu, and Urdu.
Thanks to its better prompt tokenizer, it can easily be extended to other languages. This tokenizer has a larger vocabulary and handles byte fallback, which simplifies multilingual training.
https://huggingface.co/ai4bharat/indic-parler-tts
It is a fine-tuned version of Indic Parler-TTS Pretrained, trained on a 1,806 hours multilingual Indic and English dataset.
Indic Parler-TTS Mini can officially speak in 20 Indic languages, making it comprehensive for regional language technologies, and in English. The 21 languages supported are: Assamese, Bengali, Bodo, Dogri, English, Gujarati, Hindi, Kannada, Konkani, Maithili, Malayalam, Manipuri, Marathi, Nepali, Odia, Sanskrit, Santali, Sindhi, Tamil, Telugu, and Urdu.
Thanks to its better prompt tokenizer, it can easily be extended to other languages. This tokenizer has a larger vocabulary and handles byte fallback, which simplifies multilingual training.
https://huggingface.co/ai4bharat/indic-parler-tts
huggingface.co
ai4bharat/indic-parler-tts · Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
Introducing Fish Speech 1.5 🎉 - Making state-of-the-art TTS accessible to everyone!
Highlights:
- #2 ranked on TTS-Arena (as "Anonymous Sparkle")
- 1M hours of multilingual training data
- 13 languages supported, including English, Chinese, Japanese & more
- <150ms latency with high-quality instant voice cloning
- Pretrained model now open source
- Cost-effective self-hosting or cloud options
Let's check out the details 🧵⬇️
https://x.com/FishAudio/status/1864370933496205728
Supported languages:
English (en) >300k hours
Chinese (zh) >300k hours
Japanese (ja) >100k hours
German (de) ~20k hours
French (fr) ~20k hours
Spanish (es) ~20k hours
Korean (ko) ~20k hours
Arabic (ar) ~20k hours
Russian (ru) ~20k hours
Dutch (nl) <10k hours
Italian (it) <10k hours
Polish (pl) <10k hours
Portuguese (pt) <10k hours
Highlights:
- #2 ranked on TTS-Arena (as "Anonymous Sparkle")
- 1M hours of multilingual training data
- 13 languages supported, including English, Chinese, Japanese & more
- <150ms latency with high-quality instant voice cloning
- Pretrained model now open source
- Cost-effective self-hosting or cloud options
Let's check out the details 🧵⬇️
https://x.com/FishAudio/status/1864370933496205728
Supported languages:
English (en) >300k hours
Chinese (zh) >300k hours
Japanese (ja) >100k hours
German (de) ~20k hours
French (fr) ~20k hours
Spanish (es) ~20k hours
Korean (ko) ~20k hours
Arabic (ar) ~20k hours
Russian (ru) ~20k hours
Dutch (nl) <10k hours
Italian (it) <10k hours
Polish (pl) <10k hours
Portuguese (pt) <10k hours
While widely used, discrete methods have disadvantages (there are advantages too). There are attempts to replace them with continuous models. This paper gets quite some attention
https://x.com/marco_ppasini/status/1864330701530644835
https://arxiv.org/abs/2411.18447
Continuous Autoregressive Models with Noise Augmentation Avoid Error Accumulation
Marco Pasini, Javier Nistal, Stefan Lattner, George Fazekas
Autoregressive models are typically applied to sequences of discrete tokens, but recent research indicates that generating sequences of continuous embeddings in an autoregressive manner is also feasible. However, such Continuous Autoregressive Models (CAMs) can suffer from a decline in generation quality over extended sequences due to error accumulation during inference. We introduce a novel method to address this issue by injecting random noise into the input embeddings during training. This procedure makes the model robust against varying error levels at inference. We further reduce error accumulation through an inference procedure that introduces low-level noise. Experiments on musical audio generation show that CAM substantially outperforms existing autoregressive and non-autoregressive approaches while preserving audio quality over extended sequences. This work paves the way for generating continuous embeddings in a purely autoregressive setting, opening new possibilities for real-time and interactive generative applications.
https://x.com/marco_ppasini/status/1864330701530644835
https://arxiv.org/abs/2411.18447
Continuous Autoregressive Models with Noise Augmentation Avoid Error Accumulation
Marco Pasini, Javier Nistal, Stefan Lattner, George Fazekas
Autoregressive models are typically applied to sequences of discrete tokens, but recent research indicates that generating sequences of continuous embeddings in an autoregressive manner is also feasible. However, such Continuous Autoregressive Models (CAMs) can suffer from a decline in generation quality over extended sequences due to error accumulation during inference. We introduce a novel method to address this issue by injecting random noise into the input embeddings during training. This procedure makes the model robust against varying error levels at inference. We further reduce error accumulation through an inference procedure that introduces low-level noise. Experiments on musical audio generation show that CAM substantially outperforms existing autoregressive and non-autoregressive approaches while preserving audio quality over extended sequences. This work paves the way for generating continuous embeddings in a purely autoregressive setting, opening new possibilities for real-time and interactive generative applications.
arXiv.org
Continuous Autoregressive Models with Noise Augmentation Avoid...
Autoregressive models are typically applied to sequences of discrete tokens, but recent research indicates that generating sequences of continuous embeddings in an autoregressive manner is also...
An example of issue copied from repo to repo:
https://github.com/jaywalnut310/vits/issues/11
in vits we predict float duration and then convert it to attention steps. So we need to round floats. VITS applies ceil which results in longer duration than original (usually the scale is 0.9). As a result, you need to scale back to match original length
https://github.com/jaywalnut310/vits/blob/main/models.py#L511
In glowtts there is extra clamp
https://github.com/coqui-ai/TTS/blob/main/TTS/tts/models/glow_tts.py#L351
This thing is copied from repo to repo, fun thing happends in Matcha, where we multiply by length factor after we applied ceil:
https://github.com/shivammehta25/Matcha-TTS/blob/main/matcha/models/matcha_tts.py#L122
https://github.com/jaywalnut310/vits/issues/11
in vits we predict float duration and then convert it to attention steps. So we need to round floats. VITS applies ceil which results in longer duration than original (usually the scale is 0.9). As a result, you need to scale back to match original length
https://github.com/jaywalnut310/vits/blob/main/models.py#L511
In glowtts there is extra clamp
https://github.com/coqui-ai/TTS/blob/main/TTS/tts/models/glow_tts.py#L351
This thing is copied from repo to repo, fun thing happends in Matcha, where we multiply by length factor after we applied ceil:
https://github.com/shivammehta25/Matcha-TTS/blob/main/matcha/models/matcha_tts.py#L122
GitHub
About ceiling for calculating phoneme duration · Issue #11 · jaywalnut310/vits
Is there any reason to use torch.ceil instead of torch.round or other algorithms for calculating phoneme duration? Thank you.
Very good ideas here, dirty data training, joint asr/tts and so on
https://arxiv.org/abs/2412.08237
TouchTTS: An Embarrassingly Simple TTS Framework that Everyone Can Touch
Xingchen Song, Mengtao Xing, Changwei Ma, Shengqiang Li, Di Wu, Binbin Zhang, Fuping Pan, Dinghao Zhou, Yuekai Zhang, Shun Lei, Zhendong Peng, Zhiyong Wu
It is well known that LLM-based systems are data-hungry. Recent LLM-based TTS works typically employ complex data processing pipelines to obtain high-quality training data. These sophisticated pipelines require excellent models at each stage (e.g., speech denoising, speech enhancement, speaker diarization, and punctuation models), which themselves demand high-quality training data and are rarely open-sourced. Even with state-of-the-art models, issues persist, such as incomplete background noise removal and misalignment between punctuation and actual speech pauses. Moreover, the stringent filtering strategies often retain only 10-30\% of the original data, significantly impeding data scaling efforts. In this work, we leverage a noise-robust audio tokenizer (S3Tokenizer) to design a simplified yet effective TTS data processing pipeline that maintains data quality while substantially reducing data acquisition costs, achieving a data retention rate of over 50\%. Beyond data scaling challenges, LLM-based TTS systems also incur higher deployment costs compared to conventional approaches. Current systems typically use LLMs solely for text-to-token generation, while requiring separate models (e.g., flow matching models) for token-to-waveform generation, which cannot be directly executed by LLM inference engines, further complicating deployment. To address these challenges, we eliminate redundant modules in both LLM and flow components, replacing the flow model backbone with an LLM architecture. Building upon this simplified flow backbone, we propose a unified architecture for both streaming and non-streaming inference, significantly reducing deployment costs. Finally, we explore the feasibility of unifying TTS and ASR tasks using the same data for training, thanks to the simplified pipeline and the S3Tokenizer that reduces the quality requirements for TTS training data.
https://arxiv.org/abs/2412.08237
TouchTTS: An Embarrassingly Simple TTS Framework that Everyone Can Touch
Xingchen Song, Mengtao Xing, Changwei Ma, Shengqiang Li, Di Wu, Binbin Zhang, Fuping Pan, Dinghao Zhou, Yuekai Zhang, Shun Lei, Zhendong Peng, Zhiyong Wu
It is well known that LLM-based systems are data-hungry. Recent LLM-based TTS works typically employ complex data processing pipelines to obtain high-quality training data. These sophisticated pipelines require excellent models at each stage (e.g., speech denoising, speech enhancement, speaker diarization, and punctuation models), which themselves demand high-quality training data and are rarely open-sourced. Even with state-of-the-art models, issues persist, such as incomplete background noise removal and misalignment between punctuation and actual speech pauses. Moreover, the stringent filtering strategies often retain only 10-30\% of the original data, significantly impeding data scaling efforts. In this work, we leverage a noise-robust audio tokenizer (S3Tokenizer) to design a simplified yet effective TTS data processing pipeline that maintains data quality while substantially reducing data acquisition costs, achieving a data retention rate of over 50\%. Beyond data scaling challenges, LLM-based TTS systems also incur higher deployment costs compared to conventional approaches. Current systems typically use LLMs solely for text-to-token generation, while requiring separate models (e.g., flow matching models) for token-to-waveform generation, which cannot be directly executed by LLM inference engines, further complicating deployment. To address these challenges, we eliminate redundant modules in both LLM and flow components, replacing the flow model backbone with an LLM architecture. Building upon this simplified flow backbone, we propose a unified architecture for both streaming and non-streaming inference, significantly reducing deployment costs. Finally, we explore the feasibility of unifying TTS and ASR tasks using the same data for training, thanks to the simplified pipeline and the S3Tokenizer that reduces the quality requirements for TTS training data.
arXiv.org
TouchTTS: An Embarrassingly Simple TTS Framework that Everyone Can Touch
It is well known that LLM-based systems are data-hungry. Recent LLM-based TTS works typically employ complex data processing pipelines to obtain high-quality training data. These sophisticated...
The talks of the Codec-SUPERB@SLT 2024 about neural audio codecs and speech language models are up on YouTube
https://www.youtube.com/playlist?list=PLJV_el3uVTsNnC37JYD8kBcNDI7CNJgum
https://www.youtube.com/playlist?list=PLJV_el3uVTsNnC37JYD8kBcNDI7CNJgum
YouTube
Keynote Speeches for the Codec-SUPERB Special Session @ SLT 2024
Share your videos with friends, family, and the world
CosyVoice2 release
https://funaudiollm.github.io/cosyvoice2/
https://arxiv.org/abs/2412.10117
In our previous work, we introduced CosyVoice, a multilingual speech synthesis model based on supervised discrete speech tokens. By employing progressive semantic decoding with two popular generative models, language models (LMs) and Flow Matching, CosyVoice demonstrated high prosody naturalness, content consistency, and speaker similarity in speech in-context learning. Recently, significant progress has been made in multi-modal large language models (LLMs), where the
response latency and real-time factor of speech synthesis play a crucial role in the interactive experience. Therefore, in this report, we present an improved streaming speech synthesis model, CosyVoice 2, which incorporates comprehensive and systematic optimizations. Specifically, we introduce finite-scalar quantization to improve the codebook utilization of speech tokens. For the text-speech LM, we streamline the model architecture to allow direct use of a pre-trained LLM as the backbone. In addition, we develop a chunk-aware causal flow matching model to support various synthesis scenarios, enabling both streaming and non-streaming synthesis within a single model. By training on a large-scale multilingual dataset, CosyVoice 2 achieves human-parity naturalness, minimal response latency, and virtually lossless synthesis quality in the streaming mode. We invite readers to listen to the demos at https://funaudiollm.github.io/cosyvoice2.
https://funaudiollm.github.io/cosyvoice2/
https://arxiv.org/abs/2412.10117
In our previous work, we introduced CosyVoice, a multilingual speech synthesis model based on supervised discrete speech tokens. By employing progressive semantic decoding with two popular generative models, language models (LMs) and Flow Matching, CosyVoice demonstrated high prosody naturalness, content consistency, and speaker similarity in speech in-context learning. Recently, significant progress has been made in multi-modal large language models (LLMs), where the
response latency and real-time factor of speech synthesis play a crucial role in the interactive experience. Therefore, in this report, we present an improved streaming speech synthesis model, CosyVoice 2, which incorporates comprehensive and systematic optimizations. Specifically, we introduce finite-scalar quantization to improve the codebook utilization of speech tokens. For the text-speech LM, we streamline the model architecture to allow direct use of a pre-trained LLM as the backbone. In addition, we develop a chunk-aware causal flow matching model to support various synthesis scenarios, enabling both streaming and non-streaming synthesis within a single model. By training on a large-scale multilingual dataset, CosyVoice 2 achieves human-parity naturalness, minimal response latency, and virtually lossless synthesis quality in the streaming mode. We invite readers to listen to the demos at https://funaudiollm.github.io/cosyvoice2.
Speech talks from MILA
https://poonehmousavi.github.io/rg
https://www.youtube.com/@CONVAI_RG
Recent one Discrete Audio Tokens for Multimodal LLMs by Mirco Ravanelli
https://www.youtube.com/watch?v=2-Dqzg3fuVE
Upcoming ones are also interesting
https://poonehmousavi.github.io/rg
https://www.youtube.com/@CONVAI_RG
Recent one Discrete Audio Tokens for Multimodal LLMs by Mirco Ravanelli
https://www.youtube.com/watch?v=2-Dqzg3fuVE
Upcoming ones are also interesting
poonehmousavi.github.io
Pooneh Mousavi
Homepage of Pooneh Mousavi
Big ASR from Wenet team
TouchASP: Elastic Automatic Speech Perception that Everyone Can Touch
Xingchen Song, Chengdong Liang, Binbin Zhang, Pengshen Zhang, ZiYu Wang, Youcheng Ma, Menglong Xu, Lin Wang, Di Wu, Fuping Pan, Dinghao Zhou, Zhendong Peng
Large Automatic Speech Recognition (ASR) models demand a vast number of parameters, copious amounts of data, and significant computational resources during the training process. However, such models can merely be deployed on high-compute cloud platforms and are only capable of performing speech recognition tasks. This leads to high costs and restricted capabilities. In this report, we initially propose the elastic mixture of the expert (eMoE) model. This model can be trained just once and then be elastically scaled in accordance with deployment requirements. Secondly, we devise an unsupervised data creation and validation procedure and gather millions of hours of audio data from diverse domains for training. Using these two techniques, our system achieves elastic deployment capabilities while reducing the Character Error Rate (CER) on the SpeechIO testsets from 4.98\% to 2.45\%. Thirdly, our model is not only competent in Mandarin speech recognition but also proficient in multilingual, multi-dialect, emotion, gender, and sound event perception. We refer to this as Automatic Speech Perception (ASP), and the perception results are presented in the experimental section.
TouchASP: Elastic Automatic Speech Perception that Everyone Can Touch
Xingchen Song, Chengdong Liang, Binbin Zhang, Pengshen Zhang, ZiYu Wang, Youcheng Ma, Menglong Xu, Lin Wang, Di Wu, Fuping Pan, Dinghao Zhou, Zhendong Peng
Large Automatic Speech Recognition (ASR) models demand a vast number of parameters, copious amounts of data, and significant computational resources during the training process. However, such models can merely be deployed on high-compute cloud platforms and are only capable of performing speech recognition tasks. This leads to high costs and restricted capabilities. In this report, we initially propose the elastic mixture of the expert (eMoE) model. This model can be trained just once and then be elastically scaled in accordance with deployment requirements. Secondly, we devise an unsupervised data creation and validation procedure and gather millions of hours of audio data from diverse domains for training. Using these two techniques, our system achieves elastic deployment capabilities while reducing the Character Error Rate (CER) on the SpeechIO testsets from 4.98\% to 2.45\%. Thirdly, our model is not only competent in Mandarin speech recognition but also proficient in multilingual, multi-dialect, emotion, gender, and sound event perception. We refer to this as Automatic Speech Perception (ASP), and the perception results are presented in the experimental section.
A paper from respected people. Between, testing on books (librispeech and MLS) with LLama is usually a bad idea. The thing is that Llama already seen all the books many times.
https://arxiv.org/abs/2412.16464
Transducer-Llama: Integrating LLMs into Streamable Transducer-based Speech Recognition
Keqi Deng, Jinxi Guo, Yingyi Ma, Niko Moritz, Philip C. Woodland, Ozlem Kalinli, Mike Seltzer
While large language models (LLMs) have been applied to automatic speech recognition (ASR), the task of making the model streamable remains a challenge. This paper proposes a novel model architecture, Transducer-Llama, that integrates LLMs into a Factorized Transducer (FT) model, naturally enabling streaming capabilities. Furthermore, given that the large vocabulary of LLMs can cause data sparsity issue and increased training costs for spoken language systems, this paper introduces an efficient vocabulary adaptation technique to align LLMs with speech system vocabularies. The results show that directly optimizing the FT model with a strong pre-trained LLM-based predictor using the RNN-T loss yields some but limited improvements over a smaller pre-trained LM predictor. Therefore, this paper proposes a weak-to-strong LM swap strategy, using a weak LM predictor during RNN-T loss training and then replacing it with a strong LLM. After LM replacement, the minimum word error rate (MWER) loss is employed to finetune the integration of the LLM predictor with the Transducer-Llama model. Experiments on the LibriSpeech and large-scale multi-lingual LibriSpeech corpora show that the proposed streaming Transducer-Llama approach gave a 17% relative WER reduction (WERR) over a strong FT baseline and a 32% WERR over an RNN-T baseline.
https://arxiv.org/abs/2412.16464
Transducer-Llama: Integrating LLMs into Streamable Transducer-based Speech Recognition
Keqi Deng, Jinxi Guo, Yingyi Ma, Niko Moritz, Philip C. Woodland, Ozlem Kalinli, Mike Seltzer
While large language models (LLMs) have been applied to automatic speech recognition (ASR), the task of making the model streamable remains a challenge. This paper proposes a novel model architecture, Transducer-Llama, that integrates LLMs into a Factorized Transducer (FT) model, naturally enabling streaming capabilities. Furthermore, given that the large vocabulary of LLMs can cause data sparsity issue and increased training costs for spoken language systems, this paper introduces an efficient vocabulary adaptation technique to align LLMs with speech system vocabularies. The results show that directly optimizing the FT model with a strong pre-trained LLM-based predictor using the RNN-T loss yields some but limited improvements over a smaller pre-trained LM predictor. Therefore, this paper proposes a weak-to-strong LM swap strategy, using a weak LM predictor during RNN-T loss training and then replacing it with a strong LLM. After LM replacement, the minimum word error rate (MWER) loss is employed to finetune the integration of the LLM predictor with the Transducer-Llama model. Experiments on the LibriSpeech and large-scale multi-lingual LibriSpeech corpora show that the proposed streaming Transducer-Llama approach gave a 17% relative WER reduction (WERR) over a strong FT baseline and a 32% WERR over an RNN-T baseline.
arXiv.org
Transducer-Llama: Integrating LLMs into Streamable...
While large language models (LLMs) have been applied to automatic speech recognition (ASR), the task of making the model streamable remains a challenge. This paper proposes a novel model...
Maybe notes are somewhat scattered, but I'd better not use ChatGPT to fix them. Please check our recent experiments, I'd be happy to hear your comments.
https://alphacephei.com/nsh/2025/01/03/matcha-tts-notes.html
https://alphacephei.com/nsh/2025/01/03/matcha-tts-notes.html
Speech Recognition With Vosk
Matcha TTS notes
Recently I’ve spent some time with Matcha by Shivam Mehta. Some related papers
The dataset comprises of 5000 hours speech corpus in Akan, Ewe, Dagbani, Daagare, and Ikposo. Each language includes 1000 hours of audio speech from indigenous speakers of the language and 100 hours of transcription.
https://github.com/HCI-LAB-UGSPEECHDATA/speech_data_ghana_ug
https://github.com/HCI-LAB-UGSPEECHDATA/speech_data_ghana_ug
GitHub
GitHub - HCI-LAB-UGSPEECHDATA/speech_data_ghana_ug: The dataset comprises of 5000 hours speech corpus in Akan, Ewe, Dagbani, Daagare…
The dataset comprises of 5000 hours speech corpus in Akan, Ewe, Dagbani, Daagare, and Ikposo. Each language includes 1000 hours of audio speech from indigenous speakers of the language. Of which 10...