Issues with Implementing a new model - Capacitron #455
Replies: 5 comments 9 replies
-
Did you already try with different datasets in order to rule that your problem is caused by this specific data? And what are the results with same dataset trained with "classic" Taco2? |
Beta Was this translation helpful? Give feedback.
-
:: UPDATE :: GMM (Graves) attention was not broken in the code, I had some config problems and I've realised I had left the " characters in the dataset, so after fixing those, the Graves attention plots look like this: Tensorboard for Capacitron with GMM attention. The alignments are still a bit wonky, however the synthesis got a lot better than before. The problem of not enough prosody is still there, sampling from the prior does produce good speech, however infering the latent embedding from multiple references seems to produce very similar fast and monotonic prosody. Samples. |
Beta Was this translation helpful? Give feedback.
-
::UPDATE:: I've trained the model with LJSpeech. References used for the posterior samples. Alignments: It's hard for me to tell if there was any prosody learnt in the model - I think yes because when I synthesise the same sentence and sample from the prior for it, it outputs quite varying prosody - samples. The alignments are just as wonky as before unfortunately. I've investigated the reference encoder architecture I built and I've realised that I need to fix the masking I implemented because the padding of the convolutions is different than the ones in tensorflow. I'm reporting back later with results with the new padding/masking. |
Beta Was this translation helpful? Give feedback.
-
Hey, this is very cool. I'm working on something fairly similar but a bit less complicated - a simple Tacotron-VAE model. I'm basing my work on this repo: https://github.com/jinhan/tacotron2-vae (it uses the NVIDIA Tacotron 2 implementation). It has basically the same reference encoder and z sampling procedure. I immediately found a huge improvement in swapping text for phonemes, both in terms of learning attention and also overall validation loss. I just created the phoneme sequences myself using the CMU dictionary. One major issue I ran into with the NVIDIA Tacotron 2 implementation was the dropout in the pre-net creating different prosody each time at inference, making it hard to control prosody using the VAE. I copied the solution mentioned in the Mozilla version of Tacotron (mozilla/TTS#50 (comment)) and am trying it out now. So far batch norm rather than dropout does indeed seem magical (see comment above) but I am having difficulty learning attention so am trying the suggested method of using dropout until the model learns attention and then swapping to batch norm. Dropout in the pre-net to create variation does seem to be the original Tacotron 2 authors' intention ("In order to introduce output variation at inference time, dropout with probability 0.5 is applied only to layers in the pre-net of the autoregressive decoder") and is also used in the Capacitron model according to their Appendix. However, I just found it complicated things at inference time when I was trying to isolate the effect of the VAE. Do keep us posted with your progress! |
Beta Was this translation helpful? Give feedback.
-
Hello Community!
(This will be a long post, thanks for reading.)
General Intro/Problems
As part of my master thesis, I'm implementing a new model into Coqui TTS. The model is called Capacitron from the Google Team. It's a variational encoder extension of Tacotron 1, using a similar Reference Encoder architecture as with GST Tacotron in order to model prosody. I've implemented all of the paper's extensions into the Coqui TTS architecture and I've started experimenting. The model is unfortunately still buggy and doesn't output the desired speech. I'm posting here because I have a couple of specific issues I'd like to get some help with. If you're interested, you can find the code here.
Capacitron Summarised
In the following, I'll explain the main ideas behind the Capacitron Method - it is needed for the reader to understand the stat plots I'll be posting.
Instead of the reference encoder outputing a fixed embedding for modelling prosody, the variational extension outputs values for the variational posterior distribution, a distribution with which we aim to model the posterior latent prosody space. This distribution is a simple diagonal Gaussian distribution, whose mean and variance parameters are learned during training. Therefore, during training, we sample the latent prosody embedding from the variational posterior: z ~ q(z|x).
In order to model the prior distribution of the latent space, we define a simple not-learnable diagonal Gaussian distribution p(z). During Inference, if we don't define a reference, the reference encoder will not be used, and the prosody embedding will be sampled from the prior z ~ p(z). During training, we do not sample from the prior distribution, we only need to use the distribution object itself in the loss function. This is the generative aspect of the model, the T1 decoder learns how to decode values that were sampled from this distribution so that sampling anything from the prior should give us realistic prosody.
This embedding z is concatenated to the output of the standard T1 encoder the same way as the speaker embedding / GST embedding would be.
The loss function for the capacitron model starts from the standard L1 loss for Tacotron 1. Equation 9 from the paper defines the variatonal extension of the standard L1 loss for the model.
The term on the LHS is the expected value of the negative log likelihood of generating the spectrogram x given latent embedding z, text y_t and (conditional) speaker embedding y_s. The negative log likelihood is a stand-in for the basic Tacotron Mel decoder L1 loss.
The term on the RHS is the variational and capacity extension of this model. \beta is an automatically tuned Lagrange-multiplier, a weight on the prosody embedding. The KL divergence term between the variational posterior (parameterized through the reference encoder) and the prior is called the KL-term, and is defined to be an upper bound on the mutual information between the data space and the prosody space, i.e. the capacity of the latent space. C is a scalar set for each training, a capacity limit that limits the upper bound of the mutual information to a degree that we want.
The general objective of the loss function includes two processes Firstly, it needs to minimize the loss based on the model parameters \theta - this is the usual minimization from T1 using an ADAM Optimizer. However, the function also needs to account for a maximization process, in order to maximize the single parameter \beta. More about the intuition for this whole process can be found in this pastebin. The parameter \beta is optimized using a separate SGD optimizer. As is with standard ML maximization processes, we turn this into a minimization process by putting a negative in front of \beta, which is better suited for the SGD.
Experiments
The only thing my implementation is not following from the original paper is the attention. The authors use GMM (aka Graves) attention, however the Graves attention in Coqui TTS is buggy. For this reason, I've started training with the original Tacotron attention and run experiments with DCA. DCA was somewhat also buggy in my experiments because some gradients either exploded or returned NaNs, so I had to restart the training from checkpoints, however, DCA training still yielded better results than the original attention.
Tensorboard model with original attention.
Tensorboard model with DCA.
Audio Files from both experiments. Files with Prior in their names were infered using the prior distribution (generative inference without a reference wav), and the ones using the approximate posterior distribution were infered using reference wavs from the folder you can also find through the link.
Issues I'm facing
1. Double Optimization Problem in Coqui TTS
In order to get this double optimization process working in the code, I've split the parameters passed to the optimizers in
train_tacotron.py
and also initialized the new SGD optimizer. The expected behaviour of the parameter beta is for it to converge to zero (check the pastebin intuition above why), independent of the main model ADAM optimizer. In accordance with the paper, I have also set a gradual change in learning rates passed to the ADAM Optimizer responsible for minimizing ALL weights EXCEPT \beta. Beta seems to converge nicely towards zero in the DCA training, however in the other experiment there are some firm jumps to the bottom in beta that are now corresponding to learning rate changes on the ADAM optimizer. As mentioned above, \beta is not passed to the ADAM optimizer, it's only minimized by the SGD optimizer. This makes me believe that I didn't split the two optimizers correctly - or maybe I did and this is just some strange incidence of the model?::QUESTION::
Is the way I'm splitting the optimizers/training steps correct? @mueller91 (thank you again!) has suggested this thinking for the training process:
ADAM Optimizer zero_grad() ==> SGD Optimizer zero_grad() ==> loss_dict['only beta loss'].backward() ==> SGD Optimizer step() ==> SGD optimizer zero_grad() ==> ADAM optimizer zero_grad() ==> loss_dict['everything else loss].backwards() ==> ADAM_weight_decay and gradient clipping ==> ADAM optimizer.step()
2. Broken alignments
The alignments show some strange behaviour:
DCA:
Original Attention:
::QUESTION::
I don't have an intuition where such random alignment/attention can come from. The model I'm building is using the same standard T1 Taco encoder, which is then fed with MORE INFORMATION to the decoder (the output of the reference encoder is concatenated to the output of the CBHG text encoder), however it still struggles to align well at test time. Could someone shed light on what could be a possible reason for such random behaviour? I'm planning to try to fix Graves attention in the code this week, have been too busy trying to debug my own code first, however I do see now that the attention is very much affecting the prosody/speech/general performance of this specific model.
3. Prosody
The alignments seem to be randomly broken at test time in both cases, however DCA shows way better alignments and produces better speech compared to the original attention. HOWEVER, in both cases, not a lot of prosody is actually learned in the model - some prosody is being learned, because sampling from the prior does produce understandable speech and sometimes even more prosody then the usual T1 model where it's an even, monotonic voice. When using a reference wav to infer the prosody from a reference, the samples all seem to sound the same (hear above through the audio link). The reference encoder architecture I'm using is very similar to the GST reference encoder, however it does include some major changes, such as variable input convolutional masking, a text summary network to feed information about the text to the approximate posterior, LSTM recurrence and an MLP output layer to produce the values for the approximate posterior modeled by a diagonal Gaussian.
::QUESTION::
I've gone through these lines of code many times, however I'd really appreciate it if a fresh set of eyes could take a look at it for some simple sanity check.
If you've come this far, thank you very much for reading and for the help! :)
Beta Was this translation helpful? Give feedback.
All reactions