Skip to content

Implementation of soundstream, training script for model using pytorch lightning to allow for easy multi-gpu training

License

Notifications You must be signed in to change notification settings

odunola499/soundstream-codec-lightning

Repository files navigation

Implementation of Soundstream, An Audio Codec for Speech Tasks

This repository contains an implementation of Soundstream, an audio codec designed for speech tasks with some slight modifications done to improve performance. The model training code is written with PyTorch Lightning and supports multi-GPU training for faster and more efficient training.

Current Features

  1. Model Code: The core Soundstream model architecture is implemented.
  2. Training Script: Supports multi-GPU training through PyTorch Lightning.
  3. Hugging Face Dataset Support: Uses the GPT OMNI Audio Dataset, available here, for training on speech tasks.

Coming Soon

  1. Argparse Support: Adding argparse for finer control over training settings, including the ability to choose custom datasets.
  2. More Discriminators: Plans to implement MultiScale and MultiFrequency discriminators to enhance model performance.
  3. Transformer Blocks: Adding transformer layers to the encoder and decoder modules, similar to the Mimi Codec architecture.

Ideas

This Soundstream codec is intended to be the foundation for building a text-to-speech (TTS) diffusion model. The idea is to use the continuous codes generated by the Soundstream encoder as a latent space for the diffusion model, which will eventually generate high-quality audio based on input text.

Semantic and Acoustic Tokens

In this context, we refer to semantic tokens as rich representations of the content or meaning of the audio (what is being said), while acoustic tokens carry detailed information about the sound, including pitch, tone, and other auditory features. Semantic tokens retain the meaning, whereas acoustic tokens preserve all aspects of the sound, from timbre to inflection and more.

Distillation from WavLM

The Mimi paper mentions distilling information by comparing encoder outputs to representations from a self-supervised audio language model like WavLM using cosine similarity. This process helps retain semantic information in the latent space, ensuring that the original meaning of the speech is preserved.

However, for my specific use case, I’m questioning whether this distillation step is necessary. My goal is to design the diffusion process similarly to how you would approach image inpainting, focusing on generating high-quality speech outputs without losing acoustic richness.

Key Question: Is Distillation Necessary?

Given that I am focusing on creating a high-fidelity diffusion model, is distillation necessary to retain semantic information, or is there a way to enhance acoustic information instead? If semantic retention isn’t crucial for my use case, I would prefer methods that boost the quality and richness of the acoustic information, ensuring that the resulting speech sounds natural, with all fine auditory details intact.

References

  1. Mimi Paper
  2. NaturalSpeech
  3. Seanet
  4. Soundstream
  5. This awesome repo that contains some code I lifted from.

Contributions welcome

I really enjoy conversing with people around machine learning for audio. If you would like to contribute on this or feedbacks too I would love to listen!

About

Implementation of soundstream, training script for model using pytorch lightning to allow for easy multi-gpu training

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages