UnOfficial PyTorch implementation of FastSpeech 2: Fast and High-Quality End-to-End Text to Speech. This repo uses the FastSpeech implementation of Espnet as a base. In this implementation I tried to replicate the exact paper details but still some modification required for better model, this repo open for any suggestion and improvement. This repo uses Nvidia's tacotron 2 preprocessing for audio pre-processing and MelGAN as vocoder.
All code written in Python 3.6.2
.
- Install Pytorch
Before installing pytorch please check your Cuda version by running following command :
nvcc --version
pip install torch torchvision
In this repo I have used Pytorch 1.6.0 for torch.bucketize
feature which is not present in previous versions of PyTorch.
- Installing other requirements :
pip install -r requirements.txt
- To use Tensorboard install
tensorboard version 1.14.0
seperatly with supportedtensorflow (1.14.0)
filelists
folder contains MFA (Motreal Force aligner) processed LJSpeech dataset files so you don't need to align text with audio (for extract duration) for LJSpeech dataset.
For other dataset follow instruction here. For other pre-processing run following command :
python .\nvidia_preprocessing.py -d path_of_wavs
For finding the min and max of F0 and Energy
python .\compute_statistics.py
Update the following in hparams.py
by min and max of F0 and Energy
p_min = Min F0/pitch
p_max = Max F0
e_min = Min energy
e_max = Max energy
python train_fastspeech.py --outdir etc -c configs/default.yaml -n "name"
Currently only phonemes based Synthesis supported.
python .\inference.py -c .\configs\default.yaml -p .\checkpoints\first_1\ts_version2_fastspeech_fe9a2c7_7k_steps.pyt --out output --text "ModuleList can be indexed like a regular Python list but modules it contains are properly registered."
python export_torchscript.py -c configs/default.yaml -n fastspeech_scrip --outdir etc
- Checkpoint find here
- For samples check
sample
folder.
- Coding of this repo is roughly done just to re-produce the paper and experimentation purpose. Needed a code cleanup and opyimization for better use.
- Currently this repo produces good quality audio but still it is in WIP, many improvement needed.
- Loss curve for F0 is quite high.
- I am using raw F0 and energy for train a model, but we can also use normalize F0 and energy for stable training.
- Using
Postnet
for better audio quality. - For more complete and end to end Voice cloning or Text to Speech (TTS) toolbox ⚡ please visit Deepsync Technologies.