You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Another trio of models, this time they support multiple languages (English and Polish). Here are two new samples for a sneak peek. You can check out our Colab to try it yourself!
English speech, female voice (transferred from a Polish language dataset):
whisperspeech-sample.mp4
A Polish sample, male voice:
whisperspeech-sample-pl.mp4
2023-07-14
We have trained a new pair of models, added support for multiple speakers and integrated the Vocos vocoder to deliver a big overall quality boost. And this is not even our last word because we are doing hyperparameter tuning to train bigger, higher-quality models.
An end to end generation example, inspired by one famous president's speech (don't forget to unmute the videos):
Female voice:
we-choose-tts.mp4
Male voice:
we-choose-tts-s467.mp4
We have streamlined the inference pipeline and you can now test the model yourself on Google Colab:
2023-04-13
We have trained a preliminary T->S model and a new 3kbps S->A model which improves the speech quality. Both models are far from perfect yet but we are clearly moving in the right direction (to the moon 🚀🌖!).
End-to-end TTS model with ≈ 6% WER (both T->S and S->A sampled with simple multinomial sampling at T = 0.7, no beam search) see #9 for more details:
(don't forget to unmute the video)
test-e2e-jfk-T0.7.mp4
Ground truth:
we-choose.mp4
2023-04-03
We have trained a working S->A model. It does not sound amazing but that is mostly because of EnCodec quality at 1.5kbps.
Validation set ground truth (don't forget to unmute):
ground-truth.mov
The generated output from the S->A model (multinomial sampling, temperature 0.8):
saar-1300hr-2l-20e-T0.8.mov
The text was updated successfully, but these errors were encountered:
Progress updates (from newest):
2023-12-10
Another trio of models, this time they support multiple languages (English and Polish). Here are two new samples for a sneak peek. You can check out our Colab to try it yourself!
English speech, female voice (transferred from a Polish language dataset):
whisperspeech-sample.mp4
A Polish sample, male voice:
whisperspeech-sample-pl.mp4
2023-07-14
We have trained a new pair of models, added support for multiple speakers and integrated the Vocos vocoder to deliver a big overall quality boost. And this is not even our last word because we are doing hyperparameter tuning to train bigger, higher-quality models.
An end to end generation example, inspired by one famous president's speech (don't forget to unmute the videos):
Female voice:
we-choose-tts.mp4
Male voice:
we-choose-tts-s467.mp4
We have streamlined the inference pipeline and you can now test the model yourself on Google Colab:
2023-04-13
We have trained a preliminary T->S model and a new 3kbps S->A model which improves the speech quality. Both models are far from perfect yet but we are clearly moving in the right direction (to the moon 🚀🌖!).
End-to-end TTS model with ≈ 6% WER (both T->S and S->A sampled with simple multinomial sampling at T = 0.7, no beam search) see #9 for more details:
(don't forget to unmute the video)
test-e2e-jfk-T0.7.mp4
Ground truth:
we-choose.mp4
2023-04-03
We have trained a working S->A model. It does not sound amazing but that is mostly because of EnCodec quality at 1.5kbps.
Validation set ground truth (don't forget to unmute):
ground-truth.mov
The generated output from the S->A model (multinomial sampling, temperature 0.8):
saar-1300hr-2l-20e-T0.8.mov
The text was updated successfully, but these errors were encountered: