This Hifigan implementation is compatible with output from the chasing waterfalls acoustic model. Mel spectrograms generated from the acoustic model can be read as pickle files for inference and fine-tuning. The Vocoder model was pretrained using LJ-Speech at 44kHz (upsampled using upsample.sh
). The segment_length
was doubled to train on the same time length as in the original HiFi-GAN paper.
The acoustic model was trained on mels from 44kHz, 32bit audio files with specific hop and window length. The mels generated by the acoustic model represenet the same format as mels generated using librosa.feature.melspectrogram()
and were normalized (using meldataset.norm_mel()
) prior to saving them as pickle files. Hifigans mel generation represents a torch.Tensor
implementation of the same method as librosa but with an additional dynamic range compression as a last step (see meldataset.spectral_normalize_torch
).
For inference and fine-tuning, these mels were thus denormalised and meldataset.dynamic_range_compression()
was applied to convert them to the same format as the original HiFi-GAN implementation.
Training and validation split are defined during the dataset generation of the acoustic model. For those files, a training.txt
and validation.txt
has to be generated containing all audio-filenames of the corresponding dataset. prepare dataset.py
can be used for this as followed:
python3 prepare_dataset.py --dataset_folder ../fastspeech_fork/data/K3_processed/snippets_test/wav_mono/
The model was pretrained using LJSpeech for 220k steps and then trained again on the destination dataset for another 45k steps using the commands below:
# pretraining on 44khz upsampled LJSpeech
python3 train.py --config config_am.json --input_wavs_dir LJSpeech-1.1/wavs_44khz --training_epochs 1000000 --input_training_file LJSpeech-1.1/training.txt --input_validation_file LJSpeech-1.1/validation.txt --checkpoint_interval 20000 --validation_interval 50 --stdout_interval 10
# train on real dataset at 44khz starting from ljspeech pretrained
python3 train.py --config config_am.json --input_wavs_dir /opt/waterfalls/data/vocoder/097/wav_mono --training_epochs 50000 --input_training_file /opt/waterfalls/data/vocoder/097/training.txt --input_validation_file /opt/waterfalls/data/vocoder/097/validation.txt --input_mels_dir /opt/waterfalls/data/vocoder/097/mels_diff --stdout_interval 10 --fine_tuning False --validation_interval 50
The flag --test_pickle
can be used to evaluate the model during training against a chosen mel spectrogram. See the sample pkl am_output.pkl
.
The vocoder can be used to generate 44kHz, 32bit audio files from mel spectrograms saved in pickle files as shown below:
python3 inference_e2e.py --checkpoint_file g_00140000 --input_mels_dir data/test/ --output_dir data/test/ --file_suffix _ljs_44khz_140k
The addiional inference_app.py
script is used for vocoder implementation in a svelte frontend.
See below for the original HiFi-GAN Readme.
In our paper,
we proposed HiFi-GAN: a GAN-based model capable of generating high fidelity speech efficiently.
We provide our implementation and pretrained models as open source in this repository.
Abstract : Several recent work on speech synthesis have employed generative adversarial networks (GANs) to produce raw waveforms. Although such methods improve the sampling efficiency and memory usage, their sample quality has not yet reached that of autoregressive and flow-based generative models. In this work, we propose HiFi-GAN, which achieves both efficient and high-fidelity speech synthesis. As speech audio consists of sinusoidal signals with various periods, we demonstrate that modeling periodic patterns of an audio is crucial for enhancing sample quality. A subjective human evaluation (mean opinion score, MOS) of a single speaker dataset indicates that our proposed method demonstrates similarity to human quality while generating 22.05 kHz high-fidelity audio 167.9 times faster than real-time on a single V100 GPU. We further show the generality of HiFi-GAN to the mel-spectrogram inversion of unseen speakers and end-to-end speech synthesis. Finally, a small footprint version of HiFi-GAN generates samples 13.4 times faster than real-time on CPU with comparable quality to an autoregressive counterpart.
Visit our demo website for audio samples.
- Python >= 3.6
- Clone this repository.
- Install python requirements. Please refer requirements.txt
- Download and extract the LJ Speech dataset.
And move all wav files to
LJSpeech-1.1/wavs
python train.py --config config_v1.json
To train V2 or V3 Generator, replace config_v1.json
with config_v2.json
or config_v3.json
.
Checkpoints and copy of the configuration file are saved in cp_hifigan
directory by default.
You can change the path by adding --checkpoint_path
option.
Validation loss during training with V1 generator.
You can also use pretrained models we provide.
Download pretrained models
Details of each folder are as in follows:
Folder Name | Generator | Dataset | Fine-Tuned |
---|---|---|---|
LJ_V1 | V1 | LJSpeech | No |
LJ_V2 | V2 | LJSpeech | No |
LJ_V3 | V3 | LJSpeech | No |
LJ_FT_T2_V1 | V1 | LJSpeech | Yes (Tacotron2) |
LJ_FT_T2_V2 | V2 | LJSpeech | Yes (Tacotron2) |
LJ_FT_T2_V3 | V3 | LJSpeech | Yes (Tacotron2) |
VCTK_V1 | V1 | VCTK | No |
VCTK_V2 | V2 | VCTK | No |
VCTK_V3 | V3 | VCTK | No |
UNIVERSAL_V1 | V1 | Universal | No |
We provide the universal model with discriminator weights that can be used as a base for transfer learning to other datasets.
- Generate mel-spectrograms in numpy format using Tacotron2 with teacher-forcing.
The file name of the generated mel-spectrogram should match the audio file and the extension should be.npy
.
Example:Audio File : LJ001-0001.wav Mel-Spectrogram File : LJ001-0001.npy
- Create
ft_dataset
folder and copy the generated mel-spectrogram files into it. - Run the following command.
For other command line options, please refer to the training section.
python train.py --fine_tuning True --config config_v1.json
- Make
test_files
directory and copy wav files into the directory. - Run the following command.
python inference.py --checkpoint_file [generator checkpoint file path]
Generated wav files are saved in generated_files
by default.
You can change the path by adding --output_dir
option.
- Make
test_mel_files
directory and copy generated mel-spectrogram files into the directory.
You can generate mel-spectrograms using Tacotron2, Glow-TTS and so forth. - Run the following command.
python inference_e2e.py --checkpoint_file [generator checkpoint file path]
Generated wav files are saved in generated_files_from_mel
by default.
You can change the path by adding --output_dir
option.
We referred to WaveGlow, MelGAN and Tacotron2 to implement this.