This repository contains an implementation of Soundstream, an audio codec designed for speech tasks with some slight modifications done to improve performance. The model training code is written with PyTorch Lightning and supports multi-GPU training for faster and more efficient training.
- Model Code: The core Soundstream model architecture is implemented.
- Training Script: Supports multi-GPU training through PyTorch Lightning.
- Hugging Face Dataset Support: Uses the GPT OMNI Audio Dataset, available here, for training on speech tasks.
- Argparse Support: Adding
argparse
for finer control over training settings, including the ability to choose custom datasets. - More Discriminators: Plans to implement MultiScale and MultiFrequency discriminators to enhance model performance.
- Transformer Blocks: Adding transformer layers to the encoder and decoder modules, similar to the Mimi Codec architecture.
This Soundstream codec is intended to be the foundation for building a text-to-speech (TTS) diffusion model. The idea is to use the continuous codes generated by the Soundstream encoder as a latent space for the diffusion model, which will eventually generate high-quality audio based on input text.
In this context, we refer to semantic tokens as rich representations of the content or meaning of the audio (what is being said), while acoustic tokens carry detailed information about the sound, including pitch, tone, and other auditory features. Semantic tokens retain the meaning, whereas acoustic tokens preserve all aspects of the sound, from timbre to inflection and more.
The Mimi paper mentions distilling information by comparing encoder outputs to representations from a self-supervised audio language model like WavLM using cosine similarity. This process helps retain semantic information in the latent space, ensuring that the original meaning of the speech is preserved.
However, for my specific use case, I’m questioning whether this distillation step is necessary. My goal is to design the diffusion process similarly to how you would approach image inpainting, focusing on generating high-quality speech outputs without losing acoustic richness.
Given that I am focusing on creating a high-fidelity diffusion model, is distillation necessary to retain semantic information, or is there a way to enhance acoustic information instead? If semantic retention isn’t crucial for my use case, I would prefer methods that boost the quality and richness of the acoustic information, ensuring that the resulting speech sounds natural, with all fine auditory details intact.
- Mimi Paper
- NaturalSpeech
- Seanet
- Soundstream
- This awesome repo that contains some code I lifted from.
I really enjoy conversing with people around machine learning for audio. If you would like to contribute on this or feedbacks too I would love to listen!