-
Notifications
You must be signed in to change notification settings - Fork 216
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Semantic -> acoustic modeling #4
Comments
I pushed the first version of the semantic to acoustic modeling based on the Whisper transformer model but it does not train so I am probably still having some bugs somewhere. I'm going to create a synthetic dataset and debug it like I did the quantization bottleneck. |
I found some bugs in the code and now it trains successfully:
The performance is still not great but it's a step in the right direction. :) It's still based on the old VQ/RQ tokens so this should help a bit (see #3). I also experimented with using Whisper embeddings directly (without quantization) and it works. It allowed me to easily experiment with extracting the embeddings from other layers of the encoder. Seems promising to balance the difficulty of the translation tasks between text and semantic tokens vs. semantic tokens and acoustic tokens. For reference in SPEAR TTS the semantic to acoustic task was a lot easier (they used a decoder-only model with 12 layers, about the size of Whisper Base) than the text to semantic task (T5-Large – 24 layer encoder + 24 layer decoder, the exact same size as Whisper Medium). So right now we will focus on trying to understand the balance between these two tasks. |
I've trained a new S->A model, fixed the autoregressive sampling and it started generating some recognizable speech. There is some serious bug (it generates only the first 10 seconds, everything afterwards is noise) but the common phrases ("This is a LibraVox recording", "Gentleman") already sound quite good (modulo the quality of the EnCodec speech codec at 1.5kbps). Once I figure out this bug it should start training a lot easier so I expect a big jump in quality on my next update. :) |
I fixed the 10 second generation bug (it was a bug in the sampling code). I also found out that lowering multinomial sampling temperature to 0.8 improves the quality quite a lot. I also trained another model, replacing cross-attention with adding the rescaled encoder features to the input of the middle layer of the decoder (both are sampled at a fixed rate so we don't need to learn to map one to the other) and got pretty good quality: saar-1300hr-2l-20e-T0.8.mov |
Oh, I forgot to mention that the new PyTorch 2.0 optimized attention implementation is amazing. With a very simple replacement I got 4x speedup on an A100. |
Hi @jpc, thanks for this excellent work! I have a small question about the semantic to acoustic model. I notice that you set unique as False in ur data loader, which is different from the paper. Will the semantic tokens contain prosodic information of speech? By the way, the above audio result comes from the "3. Semantic to acoustic token modeling.ipynb" or the "3B *.ipynb"? Could you provide some pre-trained models? Thanks |
Yup, our semantic tokens also carry prosody information. This makes the S2A models job easier and the overall solution faster. This means that prosody cannot be changed with voice cloning. The newest samples (in the README) sound a lot better. |
We got #3 working so now it's time to try to convert from Whisper-based semantic tokens (#3) to EnCodec-based acoustic tokens (#2).
We found out that better semantic tokens (from Whisper
medium
) make this task a lot easier and eventiny
models sound great. Multilingual semantic token training helps and cross-language voice cloning works great.There are a couple of hypothesis to test:
We also still have a couple of engineering challenges:
speed up the training a lot)
The text was updated successfully, but these errors were encountered: