Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Semantic -> acoustic modeling #4

Closed
7 of 9 tasks
jpc opened this issue Feb 28, 2023 · 7 comments
Closed
7 of 9 tasks

Semantic -> acoustic modeling #4

jpc opened this issue Feb 28, 2023 · 7 comments
Labels
goal Main sub-tasks of the project

Comments

@jpc
Copy link
Contributor

jpc commented Feb 28, 2023

We got #3 working so now it's time to try to convert from Whisper-based semantic tokens (#3) to EnCodec-based acoustic tokens (#2).

We found out that better semantic tokens (from Whisper medium) make this task a lot easier and even tiny models sound great. Multilingual semantic token training helps and cross-language voice cloning works great.

There are a couple of hypothesis to test:

  • Can we train a forward model or does it have to be autoregressive to get anywhere? (no, but see SoundStorm)
  • To start simple, could we get away with single speaker training only? This would allow us to ignore the prompting for now and just let the model memorize the speaker. (seems to work on 1000hrs of one speaker)
  • How much data is needed to get bad performance (low quality intelligible speech)? (a 1000 hours seems enough, takes about a day on A100 to train)
  • And finally, last but not least: do the Whisper encoder embeddings retain enough phonetic information to do this at all. (from initial tests in Measuring the acoustic -> semantic -> text modeling difficulty #5 they seem to be closer to speech than to text)

We also still have a couple of engineering challenges:

  • fix the issue where the model starts generating noise after exactly 10s (this may be related to cross-attention and the 3x length difference between the encoder and decoder contexts)
  • investigate sigmaReparam from Apple (supposed to make training more stable)
  • use the optimized scaled dot product attention kernels from the newest PyTorch (should
    speed up the training a lot)
  • add prompting and multiple-speakers support (we currently condition on SpeechBrain speaker embeddings)
  • switch to AdaFactor (should use less memory than Ada so we can train on smaller GPUs)
@jpc jpc modified the milestone: goal Feb 28, 2023
@jpc jpc added the goal Main sub-tasks of the project label Feb 28, 2023
@jpc
Copy link
Contributor Author

jpc commented Mar 3, 2023

I pushed the first version of the semantic to acoustic modeling based on the Whisper transformer model but it does not train so I am probably still having some bugs somewhere. I'm going to create a synthetic dataset and debug it like I did the quantization bottleneck.

@jpc
Copy link
Contributor Author

jpc commented Mar 14, 2023

I found some bugs in the code and now it trains successfully:

  1. Overfits quickly on 2 hrs of speech
  2. Trains without overfitting on my 160hr single-speaker dataset

The performance is still not great but it's a step in the right direction. :) It's still based on the old VQ/RQ tokens so this should help a bit (see #3).

I also experimented with using Whisper embeddings directly (without quantization) and it works. It allowed me to easily experiment with extracting the embeddings from other layers of the encoder. Seems promising to balance the difficulty of the translation tasks between text and semantic tokens vs. semantic tokens and acoustic tokens. For reference in SPEAR TTS the semantic to acoustic task was a lot easier (they used a decoder-only model with 12 layers, about the size of Whisper Base) than the text to semantic task (T5-Large – 24 layer encoder + 24 layer decoder, the exact same size as Whisper Medium).

So right now we will focus on trying to understand the balance between these two tasks.

@jpc
Copy link
Contributor Author

jpc commented Mar 29, 2023

I've trained a new S->A model, fixed the autoregressive sampling and it started generating some recognizable speech.

There is some serious bug (it generates only the first 10 seconds, everything afterwards is noise) but the common phrases ("This is a LibraVox recording", "Gentleman") already sound quite good (modulo the quality of the EnCodec speech codec at 1.5kbps). Once I figure out this bug it should start training a lot easier so I expect a big jump in quality on my next update. :)

@jpc
Copy link
Contributor Author

jpc commented Apr 3, 2023

I fixed the 10 second generation bug (it was a bug in the sampling code). I also found out that lowering multinomial sampling temperature to 0.8 improves the quality quite a lot.

I also trained another model, replacing cross-attention with adding the rescaled encoder features to the input of the middle layer of the decoder (both are sampled at a fixed rate so we don't need to learn to map one to the other) and got pretty good quality:

saar-1300hr-2l-20e-T0.8.mov

@jpc
Copy link
Contributor Author

jpc commented Apr 3, 2023

Oh, I forgot to mention that the new PyTorch 2.0 optimized attention implementation is amazing. With a very simple replacement I got 4x speedup on an A100.

@EmbraceAir
Copy link

Hi @jpc, thanks for this excellent work! I have a small question about the semantic to acoustic model. I notice that you set unique as False in ur data loader, which is different from the paper. Will the semantic tokens contain prosodic information of speech?

By the way, the above audio result comes from the "3. Semantic to acoustic token modeling.ipynb" or the "3B *.ipynb"? Could you provide some pre-trained models?

Thanks

@jpc
Copy link
Contributor Author

jpc commented Jan 9, 2024

Yup, our semantic tokens also carry prosody information. This makes the S2A models job easier and the overall solution faster. This means that prosody cannot be changed with voice cloning.

The newest samples (in the README) sound a lot better.

@jpc jpc closed this as completed Jan 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
goal Main sub-tasks of the project
Development

No branches or pull requests

2 participants