This is official QPPWG [1, 2] PyTorch implementation. QPPWG is a non-autoregressive neural speech generation model developed based on PWG and QP structure.
In this repo, we provide an example to train and test QPPWG as a vocoder for WORLD acoustic features. More details can be found on our Demo page.
- 2022/10/26 The related work SiFiGAN with improved inference speed is released by Reo Yoneyama (@chomeyama).
- 2022/5/10 The related work Hn-uSFGAN with further improved periodic modeling is released by Reo Yoneyama (@chomeyama).
- 2021/4/07 The related work uSFGAN with improved periodic modeling is released by Reo Yoneyama @ Nagoya University (@chomeyama).
- 2020/7/22 Release v0.1.2
- 2020/6/27 Release mel-spec feature extraction and the pre-trained models of vcc20 corpus.
- 2020/6/26 Release the pre-trained models of vcc18 corpus.
- 2020/5/20 Release the first version (v0.1.1).
This repository is tested on Ubuntu 16.04 with a Titan V GPU.
- Python 3.6+
- Cuda 10.0
- CuDNN 7+
- PyTorch 1.0.1+
The code works with both anaconda and virtualenv. The following example uses anaconda.
$ conda create -n venvQPPWG python=3.6
$ source activate venvQPPWG
$ git clone https://github.com/bigpon/QPPWG.git
$ cd QPPWG
$ pip install -e .
Please refer to the PWG repo for more details.
- egs: The folder for projects.
- egs/vcc18: The folder of the VCC2018 project.
- egs/vcc18/exp: The folder for trained models.
- egs/vcc18/conf: The folder for configs.
- egs/vcc18/data: The folder for corpus related files (wav, feature, list ...).
- qppwg: The folder of the source codes.
- Modify the corresponding CUDA paths in
egs/vcc18/run.py
. - Download the Voice Conversion Challenge 2018 (VCC2018) corpus to run the QPPWG example
$ cd egs/vcc18
# Download training and validation corpus
$ wget -o train.log -O data/wav/train.zip https://datashare.is.ed.ac.uk/bitstream/handle/10283/3061/vcc2018_database_training.zip
# Download evaluation corpus
$ wget -o eval.log -O data/wav/eval.zip https://datashare.is.ed.ac.uk/bitstream/handle/10283/3061/vcc2018_database_evaluation.zip
# unzip corpus
$ unzip data/wav/train.zip -d data/wav/
$ unzip data/wav/eval.zip -d data/wav/
- Training wav lists:
data/scp/vcc18_train_22kHz.scp
. - Validation wav lists:
data/scp/vcc18_valid_22kHz.scp
. - Testing wav list:
data/scp/vcc18_eval_22kHz.scp
.
# Extract WORLD acoustic features and statistics of training and testing data
$ bash run.sh --stage 0 --conf PWG_30
- WORLD-related settings can be changed in
egs/vcc18/conf/vcc18.PWG_30.yaml
. - If you want to use another corpus, please create a corresponding config and a file including power thresholds and f0 ranges like
egs/vcc18/data/pow_f0_dict.yml
. - More details about feature extraction can be found in the QPNet repo.
- The lists of auxiliary features will be automatically generated.
- Training aux lists:
data/scp/vcc18_train_22kHz.list
. - Validation aux lists:
data/scp/vcc18_valid_22kHz.list
. - Testing aux list:
data/scp/vcc18_eval_22kHz.list
.
# Training a QPPWG model with the 'QPPWGaf_20' config and the 'vcc18_train_22kHz' and 'vcc18_valid_22kHz' sets.
$ bash run.sh --gpu 0 --stage 1 --conf QPPWGaf_20 \
--trainset vcc18_train_22kHz --validset vcc18_valid_22kHz
- The gpu ID can be set by --gpu GPU_ID (default: 0)
- The model architecture can be set by --conf CONFIG (default: PWG_30)
- The trained model resume can be set by --resume NUM (default: None)
# QPPWG/PWG decoding w/ natural acoustic features
$ bash run.sh --gpu 0 --stage 2 --conf QPPWGaf_20 \
--iter 400000 --trainset vcc18_train_22kHz --evalset vcc18_eval_22kHz
# QPPWG/PWG decoding w/ scaled f0 (ex: halved f0).
$ bash run.sh --gpu 0 --stage 3 --conf QPPWGaf_20 --scaled 0.50 \
--iter 400000 --trainset vcc18_train_22kHz --evalset vcc18_eval_22kHz
$ tensorboard --logdir exp
- The training time of PWG_30 with a TITAN V is around 3 days.
- The training time of QPPWGaf_20 with a TITAN V is around 5 days.
- Vanilla PWG (PWG_30)
# On CPU (Intel(R) Xeon(R) Gold 6142 CPU @ 2.60GHz 32 threads)
[decode]: 100%|███████████| 140/140 [04:50<00:00, 2.08s/it, RTF=0.771]
2020-05-26 12:30:27,273 (decode:156) INFO: Finished generation of 140 utterances (RTF = 0.579).
# On GPU (TITAN V)
[decode]: 100%|███████████| 140/140 [00:09<00:00, 14.89it/s, RTF=0.0155]
2020-05-26 12:32:26,160 (decode:156) INFO: Finished generation of 140 utterances (RTF = 0.016).
- PWG w/ only 20 blocks (PWG_20)
# On CPU (Intel(R) Xeon(R) Gold 6142 CPU @ 2.60GHz 32 threads)
[decode]: 100%|███████████| 140/140 [03:57<00:00, 1.70s/it, RTF=0.761]
2020-05-30 13:50:20,438 (decode:156) INFO: Finished generation of 140 utterances (RTF = 0.474).
# On GPU (TITAN V)
[decode]: 100%|███████████| 140/140 [00:08<00:00, 16.55it/s, RTF=0.0105]
2020-05-30 13:43:50,793 (decode:156) INFO: Finished generation of 140 utterances (RTF = 0.011).
- QPPWG (QPPWGaf_20)
# On CPU (Intel(R) Xeon(R) Gold 6142 CPU @ 2.60GHz 32 threads)
[decode]: 100%|███████████| 140/140 [04:12<00:00, 1.81s/it, RTF=0.455]
2020-05-26 12:38:15,982 (decode:156) INFO: Finished generation of 140 utterances (RTF = 0.512).
# On GPU (TITAN V)
[decode]: 100%|███████████| 140/140 [00:11<00:00, 12.57it/s, RTF=0.0218]
2020-05-26 12:33:32,469 (decode:156) INFO: Finished generation of 140 utterances (RTF = 0.020).
- The pre-trained models and generated utterances are released.
- You can download the whole folder of each corpus and then put it in
egs/[corpus]
to run speech generations with the pre-trained models. - You also can only download the
[corpus]/data
folder and the desired pre-trained model and then put thedata
folder inegs/[corpus]
and the model folder inegs/[corpus]/exp
. - Models with both 100,000 iterations (trained w/ only STFT loss) and 400,000 iterations (trained w/ STFT and GAN losses) are released.
- The generated utterances are in the
wav
folder of each model’s folder.
Corpus | Language | Fs [Hz] | Feature | Model | Conf |
---|---|---|---|---|---|
vcc18 | EN | 22050 | world (uv + f0 + mcep + ap) (shiftms: 5) |
PWG_20 | link |
PWG_30 | link | ||||
QPPWGaf_20 | link | ||||
vcc20 | EN, FI, DE, ZH | 24000 | melf0h128 (uv + f0 + mel-spc) (hop_size: 128) |
PWG_20 | link |
PWG_30 | link | ||||
QPPWGaf_20 | link |
The minimum code for performing analysis and synthesis is presented.
# Make sure you have installed `qppwg`
# If not, install it via pip
$ pip install qppwg
# Take "vcc18" corpus as an example
# Download the whole folder of "vcc18"
$ ls vcc18
data exp
# Change directory to `vcc18` folder
$ cd vcc18
# Put audio files in `data/wav/` directory
$ ls data/wav/
sample1.wav sample2.wav
# Create a list `data/sample.scp` of the audio files
$ tail data/scp/sample.scp
data/wav/sample1.wav
data/wav/sample2.wav
# Extract acoustic features
$ qppwg-preprocess \
--audio data/scp/sample.scp \
--indir wav \
--outdir hdf5 \
--config exp/qppwg_vcc18_train_22kHz_QPPWGaf_20/config.yml
# The extracted features are in `data/hdf5/`
# The feature list `data/sample.list` of the feature files will be automatically generated
$ ls data/hdf5/
sample1.h5 sample2.h5
$ ls data/scp/
sample.scp sample.list
# Synthesis
$ qppwg-decode \
--eval_feat data/scp/sample.list \
--stats data/stats/vcc18_train_22kHz.joblib \
--indir data/hdf5/ \
--outdir exp/qppwg_vcc18_train_22kHz_QPPWGaf_20/wav/400000/ \
--checkpoint exp/qppwg_vcc18_train_22kHz_QPPWGaf_20/checkpoint-400000steps.pkl
# Synthesis w/ halved F0
$ qppwg-decode \
--f0_factor 0.50 \
--eval_feat data/scp/sample.list \
--stats data/stats/vcc18_train_22kHz.joblib \
--indir data/hdf5/ \
--outdir exp/qppwg_vcc18_train_22kHz_QPPWGaf_20/wav/400000/ \
--checkpoint exp/qppwg_vcc18_train_22kHz_QPPWGaf_20/checkpoint-400000steps.pkl
# The generated utterances can be found in `exp/[model]/wav/400000/`
$ ls exp/qppwg_vcc18_train_22kHz_QPPWGaf_20/wav/400000/
sample1.wav sample1_f0.50.wav sample2.wav sample2_f0.50.wav
The QPPWG repository is developed based on the following repositories and paper.
If you find the code is helpful, please cite the following article.
@inproceedings{qppwg_2020,
author={Yi-Chiao Wu and Tomoki Hayashi and Takuma Okamoto and Hisashi Kawai and Tomoki Toda},
title={{Quasi-Periodic Parallel WaveGAN Vocoder: A Non-Autoregressive Pitch-Dependent Dilated Convolution Model for Parametric Speech Generation}},
year=2020,
booktitle={Proc. Interspeech 2020},
pages={3535--3539},
doi={10.21437/Interspeech.2020-1070},
url={http://dx.doi.org/10.21437/Interspeech.2020-1070}
}
@ARTICLE{9324976,
author={Y. -C. {Wu} and T. {Hayashi} and T. {Okamoto} and H. {Kawai} and T. {Toda}},
journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
title={Quasi-Periodic Parallel WaveGAN: A Non-Autoregressive Raw Waveform Generative Model With Pitch-Dependent Dilated Convolution Neural Network},
year={2021},
volume={29},
pages={792-806},
doi={10.1109/TASLP.2021.3051765}}
Development:
Yi-Chiao Wu @ Nagoya University (@bigpon)
E-mail: [email protected]
Advisor:
Tomoki Toda @ Nagoya University
E-mail: [email protected]