Quasi-Periodic Parallel WaveGAN (QPPWG)

This is official QPPWG [1, 2] PyTorch implementation. QPPWG is a non-autoregressive neural speech generation model developed based on PWG and QP structure.

In this repo, we provide an example to train and test QPPWG as a vocoder for WORLD acoustic features. More details can be found on our Demo page.

News

2022/10/26 The related work SiFiGAN with improved inference speed is released by Reo Yoneyama (@chomeyama).
2022/5/10 The related work Hn-uSFGAN with further improved periodic modeling is released by Reo Yoneyama (@chomeyama).
2021/4/07 The related work uSFGAN with improved periodic modeling is released by Reo Yoneyama @ Nagoya University (@chomeyama).
2020/7/22 Release v0.1.2
2020/6/27 Release mel-spec feature extraction and the pre-trained models of vcc20 corpus.
2020/6/26 Release the pre-trained models of vcc18 corpus.
2020/5/20 Release the first version (v0.1.1).

Requirements

This repository is tested on Ubuntu 16.04 with a Titan V GPU.

Python 3.6+
Cuda 10.0
CuDNN 7+
PyTorch 1.0.1+

Environment setup

The code works with both anaconda and virtualenv. The following example uses anaconda.

$ conda create -n venvQPPWG python=3.6
$ source activate venvQPPWG
$ git clone https://github.com/bigpon/QPPWG.git
$ cd QPPWG
$ pip install -e .

Please refer to the PWG repo for more details.

Folder architecture

egs: The folder for projects.
egs/vcc18: The folder of the VCC2018 project.
egs/vcc18/exp: The folder for trained models.
egs/vcc18/conf: The folder for configs.
egs/vcc18/data: The folder for corpus related files (wav, feature, list ...).
qppwg: The folder of the source codes.

Run

Corpus and path setup

Modify the corresponding CUDA paths in egs/vcc18/run.py.
Download the Voice Conversion Challenge 2018 (VCC2018) corpus to run the QPPWG example

$ cd egs/vcc18
# Download training and validation corpus
$ wget -o train.log -O data/wav/train.zip https://datashare.is.ed.ac.uk/bitstream/handle/10283/3061/vcc2018_database_training.zip
# Download evaluation corpus
$ wget -o eval.log -O data/wav/eval.zip https://datashare.is.ed.ac.uk/bitstream/handle/10283/3061/vcc2018_database_evaluation.zip
# unzip corpus
$ unzip data/wav/train.zip -d data/wav/
$ unzip data/wav/eval.zip -d data/wav/

Training wav lists: data/scp/vcc18_train_22kHz.scp.
Validation wav lists: data/scp/vcc18_valid_22kHz.scp.
Testing wav list: data/scp/vcc18_eval_22kHz.scp.

Preprocessing

# Extract WORLD acoustic features and statistics of training and testing data
$ bash run.sh --stage 0 --conf PWG_30

WORLD-related settings can be changed in egs/vcc18/conf/vcc18.PWG_30.yaml.
If you want to use another corpus, please create a corresponding config and a file including power thresholds and f0 ranges like egs/vcc18/data/pow_f0_dict.yml.
More details about feature extraction can be found in the QPNet repo.
The lists of auxiliary features will be automatically generated.
Training aux lists: data/scp/vcc18_train_22kHz.list.
Validation aux lists: data/scp/vcc18_valid_22kHz.list.
Testing aux list: data/scp/vcc18_eval_22kHz.list.

QPPWG/PWG training

# Training a QPPWG model with the 'QPPWGaf_20' config and the 'vcc18_train_22kHz' and 'vcc18_valid_22kHz' sets.
$ bash run.sh --gpu 0 --stage 1 --conf QPPWGaf_20 \
--trainset vcc18_train_22kHz --validset vcc18_valid_22kHz

The gpu ID can be set by --gpu GPU_ID (default: 0)
The model architecture can be set by --conf CONFIG (default: PWG_30)
The trained model resume can be set by --resume NUM (default: None)

QPPWG/PWG testing

# QPPWG/PWG decoding w/ natural acoustic features
$ bash run.sh --gpu 0 --stage 2 --conf QPPWGaf_20 \
--iter 400000 --trainset vcc18_train_22kHz --evalset vcc18_eval_22kHz
# QPPWG/PWG decoding w/ scaled f0 (ex: halved f0).
$ bash run.sh --gpu 0 --stage 3 --conf QPPWGaf_20 --scaled 0.50 \
--iter 400000 --trainset vcc18_train_22kHz --evalset vcc18_eval_22kHz

Monitor training progress

$ tensorboard --logdir exp

The training time of PWG_30 with a TITAN V is around 3 days.
The training time of QPPWGaf_20 with a TITAN V is around 5 days.

Inference speed (RTF)

Vanilla PWG (PWG_30)

# On CPU (Intel(R) Xeon(R) Gold 6142 CPU @ 2.60GHz 32 threads)
[decode]: 100%|███████████| 140/140 [04:50<00:00,  2.08s/it, RTF=0.771]
2020-05-26 12:30:27,273 (decode:156) INFO: Finished generation of 140 utterances (RTF = 0.579).
# On GPU (TITAN V)
[decode]: 100%|███████████| 140/140 [00:09<00:00, 14.89it/s, RTF=0.0155]
2020-05-26 12:32:26,160 (decode:156) INFO: Finished generation of 140 utterances (RTF = 0.016).

PWG w/ only 20 blocks (PWG_20)

# On CPU (Intel(R) Xeon(R) Gold 6142 CPU @ 2.60GHz 32 threads)
[decode]: 100%|███████████| 140/140 [03:57<00:00,  1.70s/it, RTF=0.761]
2020-05-30 13:50:20,438 (decode:156) INFO: Finished generation of 140 utterances (RTF = 0.474).
# On GPU (TITAN V)
[decode]: 100%|███████████| 140/140 [00:08<00:00, 16.55it/s, RTF=0.0105]
2020-05-30 13:43:50,793 (decode:156) INFO: Finished generation of 140 utterances (RTF = 0.011).

QPPWG (QPPWGaf_20)

# On CPU (Intel(R) Xeon(R) Gold 6142 CPU @ 2.60GHz 32 threads)
[decode]: 100%|███████████| 140/140 [04:12<00:00,  1.81s/it, RTF=0.455]
2020-05-26 12:38:15,982 (decode:156) INFO: Finished generation of 140 utterances (RTF = 0.512).
# On GPU (TITAN V)
[decode]: 100%|███████████| 140/140 [00:11<00:00, 12.57it/s, RTF=0.0218]
2020-05-26 12:33:32,469 (decode:156) INFO: Finished generation of 140 utterances (RTF = 0.020).

Models and results

The pre-trained models and generated utterances are released.
You can download the whole folder of each corpus and then put it in egs/[corpus] to run speech generations with the pre-trained models.
You also can only download the [corpus]/data folder and the desired pre-trained model and then put the data folder in egs/[corpus] and the model folder in egs/[corpus]/exp.
Models with both 100,000 iterations (trained w/ only STFT loss) and 400,000 iterations (trained w/ STFT and GAN losses) are released.
The generated utterances are in the wav folder of each model’s folder.

Corpus	Language	Fs [Hz]	Feature	Model	Conf
vcc18	EN	22050	world (uv + f0 + mcep + ap) (shiftms: 5)	PWG_20	link
				PWG_30	link
				QPPWGaf_20	link
vcc20	EN, FI, DE, ZH	24000	melf0h128 (uv + f0 + mel-spc) (hop_size: 128)	PWG_20	link
				PWG_30	link
				QPPWGaf_20	link

Usage of pre-trained models

Analysis-synthesis

The minimum code for performing analysis and synthesis is presented.

# Make sure you have installed `qppwg`
# If not, install it via pip
$ pip install qppwg
# Take "vcc18" corpus as an example
# Download the whole folder of "vcc18"
$ ls vcc18
  data    exp
# Change directory to `vcc18` folder
$ cd vcc18
# Put audio files in `data/wav/` directory
$ ls data/wav/
  sample1.wav    sample2.wav
# Create a list `data/sample.scp` of the audio files
$ tail data/scp/sample.scp
  data/wav/sample1.wav
  data/wav/sample2.wav
# Extract acoustic features
$ qppwg-preprocess \
    --audio data/scp/sample.scp \
    --indir wav \
    --outdir hdf5 \
    --config exp/qppwg_vcc18_train_22kHz_QPPWGaf_20/config.yml
# The extracted features are in `data/hdf5/`
# The feature list `data/sample.list` of the feature files will be automatically generated
$ ls data/hdf5/
  sample1.h5    sample2.h5
$ ls data/scp/
  sample.scp    sample.list
# Synthesis
$ qppwg-decode \
    --eval_feat data/scp/sample.list \
    --stats data/stats/vcc18_train_22kHz.joblib \
    --indir data/hdf5/ \
    --outdir exp/qppwg_vcc18_train_22kHz_QPPWGaf_20/wav/400000/ \
    --checkpoint exp/qppwg_vcc18_train_22kHz_QPPWGaf_20/checkpoint-400000steps.pkl
# Synthesis w/ halved F0
$ qppwg-decode \
    --f0_factor 0.50 \
    --eval_feat data/scp/sample.list \
    --stats data/stats/vcc18_train_22kHz.joblib \
    --indir data/hdf5/ \
    --outdir exp/qppwg_vcc18_train_22kHz_QPPWGaf_20/wav/400000/ \
    --checkpoint exp/qppwg_vcc18_train_22kHz_QPPWGaf_20/checkpoint-400000steps.pkl
# The generated utterances can be found in `exp/[model]/wav/400000/`
$ ls exp/qppwg_vcc18_train_22kHz_QPPWGaf_20/wav/400000/
  sample1.wav    sample1_f0.50.wav    sample2.wav    sample2_f0.50.wav

References

The QPPWG repository is developed based on the following repositories and paper.

Citation

If you find the code is helpful, please cite the following article.

@inproceedings{qppwg_2020,
author={Yi-Chiao Wu and Tomoki Hayashi and Takuma Okamoto and Hisashi Kawai and Tomoki Toda},
title={{Quasi-Periodic Parallel WaveGAN Vocoder: A Non-Autoregressive Pitch-Dependent Dilated Convolution Model for Parametric Speech Generation}},
year=2020,
booktitle={Proc. Interspeech 2020},
pages={3535--3539},
doi={10.21437/Interspeech.2020-1070},
url={http://dx.doi.org/10.21437/Interspeech.2020-1070}
}

@ARTICLE{9324976,
author={Y. -C. {Wu} and T. {Hayashi} and T. {Okamoto} and H. {Kawai} and T. {Toda}},
journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
title={Quasi-Periodic Parallel WaveGAN: A Non-Autoregressive Raw Waveform Generative Model With Pitch-Dependent Dilated Convolution Neural Network},
year={2021},
volume={29},
pages={792-806},
doi={10.1109/TASLP.2021.3051765}}

Authors

Development: Yi-Chiao Wu @ Nagoya University (@bigpon)
E-mail: [email protected]

Advisor: Tomoki Toda @ Nagoya University
E-mail: [email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
egs		egs
qppwg		qppwg
tools		tools
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Quasi-Periodic Parallel WaveGAN (QPPWG)

News

Requirements

Environment setup

Folder architecture

Run

Corpus and path setup

Preprocessing

QPPWG/PWG training

QPPWG/PWG testing

Monitor training progress

Inference speed (RTF)

Models and results

Usage of pre-trained models

Analysis-synthesis

References

Citation

Authors

About

Releases

Packages

Languages

License

bigpon/QPPWG

Folders and files

Latest commit

History

Repository files navigation

Quasi-Periodic Parallel WaveGAN (QPPWG)

News

Requirements

Environment setup

Folder architecture

Run

Corpus and path setup

Preprocessing

QPPWG/PWG training

QPPWG/PWG testing

Monitor training progress

Inference speed (RTF)

Models and results

Usage of pre-trained models

Analysis-synthesis

References

Citation

Authors

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages