Implementation of Soundstream, An Audio Codec for Speech Tasks

This repository contains an implementation of Soundstream, an audio codec designed for speech tasks with some slight modifications done to improve performance. The model training code is written with PyTorch Lightning and supports multi-GPU training for faster and more efficient training.

Current Features

Model Code: The core Soundstream model architecture is implemented.
Training Script: Supports multi-GPU training through PyTorch Lightning.
Hugging Face Dataset Support: Uses the GPT OMNI Audio Dataset, available here, for training on speech tasks.

Coming Soon

Argparse Support: Adding argparse for finer control over training settings, including the ability to choose custom datasets.
More Discriminators: Plans to implement MultiScale and MultiFrequency discriminators to enhance model performance.
Transformer Blocks: Adding transformer layers to the encoder and decoder modules, similar to the Mimi Codec architecture.

Ideas

This Soundstream codec is intended to be the foundation for building a text-to-speech (TTS) diffusion model. The idea is to use the continuous codes generated by the Soundstream encoder as a latent space for the diffusion model, which will eventually generate high-quality audio based on input text.

Semantic and Acoustic Tokens

In this context, we refer to semantic tokens as rich representations of the content or meaning of the audio (what is being said), while acoustic tokens carry detailed information about the sound, including pitch, tone, and other auditory features. Semantic tokens retain the meaning, whereas acoustic tokens preserve all aspects of the sound, from timbre to inflection and more.

Distillation from WavLM

The Mimi paper mentions distilling information by comparing encoder outputs to representations from a self-supervised audio language model like WavLM using cosine similarity. This process helps retain semantic information in the latent space, ensuring that the original meaning of the speech is preserved.

However, for my specific use case, I’m questioning whether this distillation step is necessary. My goal is to design the diffusion process similarly to how you would approach image inpainting, focusing on generating high-quality speech outputs without losing acoustic richness.

Key Question: Is Distillation Necessary?

Given that I am focusing on creating a high-fidelity diffusion model, is distillation necessary to retain semantic information, or is there a way to enhance acoustic information instead? If semantic retention isn’t crucial for my use case, I would prefer methods that boost the quality and richness of the acoustic information, ensuring that the resulting speech sounds natural, with all fine auditory details intact.

References

Mimi Paper
NaturalSpeech
Seanet
Soundstream
This awesome repo that contains some code I lifted from.

Contributions welcome

I really enjoy conversing with people around machine learning for audio. If you would like to contribute on this or feedbacks too I would love to listen!

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
LICENSE		LICENSE
README.md		README.md
base_layers.py		base_layers.py
config.yaml		config.yaml
conv.py		conv.py
dataset.py		dataset.py
losses.py		losses.py
msstftd.py		msstftd.py
norm.py		norm.py
ops.py		ops.py
positional_encoding.py		positional_encoding.py
pqmf.py		pqmf.py
quantization.py		quantization.py
seanet.py		seanet.py
soundstream.py		soundstream.py
torch_stft.py		torch_stft.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Implementation of Soundstream, An Audio Codec for Speech Tasks

Current Features

Coming Soon

Ideas

Semantic and Acoustic Tokens

Distillation from WavLM

Key Question: Is Distillation Necessary?

References

Contributions welcome

About

Releases

Packages

Languages

License

odunola499/soundstream-codec-lightning

Folders and files

Latest commit

History

Repository files navigation

Implementation of Soundstream, An Audio Codec for Speech Tasks

Current Features

Coming Soon

Ideas

Semantic and Acoustic Tokens

Distillation from WavLM

Key Question: Is Distillation Necessary?

References

Contributions welcome

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages