VITS is a light weight, low-latency model for English text-to-speech (TTS). Massively Multilingual Speech (MMS) is an extension of VITS for multilingual TTS, that supports over 1100 languages.
Both use the same underlying VITS architecture, consisting of a discriminator and a generator for GAN-based training. They differ in their tokenizers: the VITS tokenizer transforms English input text into phonemes, while the MMS tokenizer transforms input text into character-based tokens.
You should fine-tune VITS-based checkpoints if you want to use a permissive English TTS model and fine-tune MMS-based checkpoints for every other cases.
Coupled with the right data and the following training recipe, you can get an excellent finetuned version of every VITS/MMS checkpoints in 20 minutes with as little as 80 to 150 samples.
Finetuning VITS or MMS requires multiples stages to be completed in successive order:
- Install requirements
- Choose or create the initial model
- Finetune the model
- Optional - how to use the finetuned model
Try out these spaces:
- Explore English and Spanish finetuning on accents
- Explore finetuning on Spanish, English, Tamil, Gujarati, Marathi
Open to listen to snippets of before and after finetuning
MMS.finetuning.mp4
The VITS checkpoints are released under the permissive MIT License. The MMS checkpoints, on the other hand, are licensed under CC BY-NC 4.0, a non-commercial license.
Note: Any finetuned models derived from these checkpoints will inherit the same licenses as their respective base models. Please ensure that you comply with the terms of the applicable license when using or distributing these models.
- Clone this repository and install common requirements.
git clone [email protected]:ylacombe/finetune-hf-vits.git
cd finetune-hf-vits
pip install -r requirements.txt
- Link your Hugging Face account so that you can pull/push model repositories on the Hub. This will allow you to save the finetuned weights on the Hub so that you can share them with the community and reuse them easily. Run the command:
git config --global credential.helper store
huggingface-cli login
And then enter an authentication token from https://huggingface.co/settings/tokens. Create a new token if you do not have one already. You should make sure that this token has "write" privileges.
- Build the monotonic alignment search function using cython. This is absolutely necessary since the Python-native-version is awfully slow.
# Cython-version Monotonoic Alignment Search
cd monotonic_align
mkdir monotonic_align
python setup.py build_ext --inplace
cd ..
- (Optional) If you're using an original VITS checkpoint, as opposed to MMS checkpoints, install phonemizer.
Follow steps indicated here.
Open for an example on Debian/Unbuntu
E.g, if you're on Debian/Unbuntu:
# Install dependencies
sudo apt-get install festival espeak-ng mbrola
# Install phonemizer
pip install phonemizer
- (Optional) With MMS checkpoints, some languages require to install uroman.
Open for details
Some languages require to use uroman
before feeding the text to VitsTokenizer
, since currently the tokenizer does not support performing the pre-processing itself.
To do this, you need to clone the uroman repository to your local machine and set the bash variable UROMAN to the local path:
git clone https://github.com/isi-nlp/uroman.git
cd uroman
export UROMAN=$(pwd)
The rest is taking care of by the training script. Don't forget to adapt the inference snippet as indicated here.
There are two options:
Option 1: a training checkpoint is already available
Some checkpoints are already available, and chances are that the language that you want to train your model on already has a checkpoint.
Here is a non-exhaustive list of available checkpoint:
Open for a checkpoints list
- English
ylacombe/vits-ljs-with-discriminator
(make sure the phonemizer package is installed) - ideal for monolingual finetuningylacombe/vits-vctk-with-discriminator
(make sure the phonemizer package is installed) - ideal for multispeaker English finetuning.ylacombe/mms-tts-eng-train
- if you want to avoid the use of thephonemizer
package.
- Spanish -
ylacombe/mms-tts-spa-train
- Korean -
ylacombe/mms-tts-kor-train
- Marathi -
ylacombe/mms-tts-mar-train
- Tamil -
ylacombe/mms-tts-tam-train
- Gujarati -
ylacombe/mms-tts-guj-train
In that case you found the right checkpoints, note the repository name and pass directly to the next step 🤗.
Option 2: no training checkpoint is available for your language
Let's say that you want have a text-to-speech dataset in Ghari, a Malayo-Polynesian language.
First, find if your language is covered by identifying if there is a MMS checkpoint trained on this language by searching for the language in the MMS Language Coverage Overview.
If it is, identify the ISO 693-3 language code, here gri
.
Contrary to inference, finetuning requires the use of a discriminator that needs to be converted. So you want to first creates a new checkpoint with this converted discriminator.
In the following steps, replace gri
with the language code you identified and with where you want to save the model locally.
The model will also be pushed to your hub repository <your HF handle>/<repo-id-you-want>
. Simply remove --push_to_hub <repo-id-you-want>
if you don't want to push to the hub:
cd <path-to-finetune-hf-vits-repo>
python convert_original_discriminator_checkpoint.py --language_code gri --pytorch_dump_folder_path <local-folder> --push_to_hub <repo-id-you-want>
You can now use <repo-id-you-want>
or <local-folder>
as a starting point to finetune your model!
Note
You only need to do this step once per language.
There are two ways to run the finetuning scrip, both using command lines. Note that you only need one GPU to finetune VITS/MMS as the models are really lightweight (83M parameters).
Preferred way: use a json config file
Note
Using a config file is the prefered way to use the finetuning script as it includes the most important parameters to consider. For a full list of parameters, run python run_vits_finetuning.py --help
. Note that some parameters are not ignored by the training script.
The training_config_examples folder hosts examples of config files. Once satisfied with your config file, you can then finetune the model.
For example, finetune_english.json is a working example of finetuning on a Welsh female accent.
accelerate launch run_vits_finetuning.py ./training_config_examples/finetune_english.json
Other option: pass parameters directly to the command line.
For example:
accelerate launch run_vits_finetuning.py --model_name_or_path MODEL_NAME_OR_PATH --output_dir OUTPUT_DIR ...
Important parameters to consider:
- Everything related to artefacts: the
project_name
and the output directories (hub_model_id
,output_dir
) to keep track of the model. - The model to finetune:
model_name_or_path
.- Here it should point to the training checkpoint of the previous section.
- For example, if you choose an already existing checkpoint:
ylacombe/vits-ljs-with-discriminator
, or if you converted your own checkpoint:<repo-id-you-want>
or<local-folder>
.
- The dataset used
dataset_name
and its details:dataset_config_name
, column names, etc.- If there are multiple speakers and you want to only keep one, be careful to
speaker_id_column_name
,override_speaker_embeddings
andfilter_on_speaker_id
. The latter allows to keep only one speaker but you can also train on multiple speakers. - For example the dataset used by default in
finetune_english.json
is a subset of British Isles accents dataset, using a single Welsh female voice of thewelsh_female
configuration, identified byspeaker_id=5223
.
- If there are multiple speakers and you want to only keep one, be careful to
- The most important hyperparameters
learning_rate
batch_size
- the different losses weights: weight_duration, weight_kl, weight_mel, weight_disc, weight_gen, weight_fmaps
Note
The training_config_examples also contains two other examples, one to finetune a Gujarati checkpoint and another to finetune a Korean checkpoint. Those examples also shows how to track experiments using wandb.
You can use a finetuned model via the Text-to-Speech (TTS) pipeline in just a few lines of code!
Just replace ylacombe/vits_ljs_welsh_female_monospeaker_2
with your own model id (hub_model_id
) or path to the model (output_dir
).
from transformers import pipeline
import scipy
model_id = "ylacombe/vits_ljs_welsh_female_monospeaker_2"
synthesiser = pipeline("text-to-speech", model_id) # add device=0 if you want to use a GPU
speech = synthesiser("Hello, my dog is cooler than you!")
scipy.io.wavfile.write("finetuned_output.wav", rate=speech["sampling_rate"], data=speech["audio"][0])
Note that if your model needs to use uroman
to train, you also should apply the uroman package to your text inputs prior to passing them to the pipeline:
import os
import subprocess
from transformers import pipeline
import scipy
model_id = "facebook/mms-tts-kor"
synthesiser = pipeline("text-to-speech", model_id) # add device=0 if you want to use a GPU
def uromanize(input_string, uroman_path):
"""Convert non-Roman strings to Roman using the `uroman` perl package."""
script_path = os.path.join(uroman_path, "bin", "uroman.pl")
command = ["perl", script_path]
process = subprocess.Popen(command, stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
# Execute the perl command
stdout, stderr = process.communicate(input=input_string.encode())
if process.returncode != 0:
raise ValueError(f"Error {process.returncode}: {stderr.decode()}")
# Return the output as a string and skip the new-line character at the end
return stdout.decode()[:-1]
text = "이봐 무슨 일이야"
uromanized_text = uromanize(text, uroman_path=os.environ["UROMAN"])
speech = synthesiser(uromanized_text)
scipy.io.wavfile.write("finetuned_output.wav", rate=speech["sampling_rate"], data=speech["audio"][0])
- VITS was proposed in 2021, in Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech by Jaehyeon Kim, Jungil Kong, Juhee Son. You can find the original codebase here.
- MMS was proposed in Scaling Speech Technology to 1,000+ Languages by Vineel Pratap, Andros Tjandra, Bowen Shi and co. You can find more details about the supported languages and their ISO 639-3 codes in the MMS Language Coverage Overview, and see all MMS-TTS checkpoints on the Hugging Face Hub: facebook/mms-tts.
- Hugging Face 🤗 Transformers for the model integration, Hugging Face 🤗 Accelerate for the distributed code and Hugging Face 🤗 datasets for facilitating datasets access.
- @nivibilla's adapation of HifiGan's discriminator, used for English VITS training.