Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

strange noises in your samples && error when running inference.py #30

Closed
MorganCZY opened this issue Nov 11, 2019 · 16 comments
Closed

strange noises in your samples && error when running inference.py #30

MorganCZY opened this issue Nov 11, 2019 · 16 comments

Comments

@MorganCZY
Copy link

Your samples at epoch 3200 have strange noises at unvoiced segments, while there is no such phenomenon in samples at epoch 1600.
noise
Besides, when running inference.py, an error occurs, pointing to

mel = torch.cat((mel, zero), axis=2)
torch.cat() has a parameter "dim" rather than "axis"
cat

@seungwonpark
Copy link
Owner

  • Fixed the latter issue, thank you!
  • Yes, I was aware of that issue. I also found out that results from mel-spectrogram generated from 0-filled audio (which is -11.5129-filled mel) are very noisy. I really need to solve this issue.

@seungwonpark
Copy link
Owner

seungwonpark commented Nov 11, 2019

zeromel.zip

Spectrogram of zero-filled audio reconstruction looks like this: the line noise appears every 4 frequency bins.

EDIT: total frequency bins are 512. So the pattern appears for every 4 bins, not 8. The y-axis of the figure below is wrong.

zeromel

@seungwonpark
Copy link
Owner

I hope to fix it by matching the implementation details with official implementation. See #17.

@MorganCZY
Copy link
Author

I have trained and tested the official MelGAN repo. Synthesized samples are with audible noises. The overall effect is far less than the official pretrained model.

@seungwonpark
Copy link
Owner

Oh, does that mean we need to use some tricks (that aren’t shown in paper) to properly train the model?

@MorganCZY
Copy link
Author

I highly doubt there are some training tricks that are not shown in the official repo codes. I left an issue at their repo, but haven't received a reply till now.

@bob80333
Copy link

Checkboard artifacts have been an issue with image GANs before, see this article: https://distill.pub/2016/deconv-checkerboard/

I think some of these audio artifacts may be related. The main way to get rid of them was to replace strided conv layers with bilinear upsample/downsample + conv layers, or to ensure that kernels were exact multiples of their strides. The discriminator here appears to have kernels of 41 with strides of 4, I wonder what would happen if we stuck a bilinear downsample 4x before those convs, and set the stride to 1.

I'm going to try this out myself, but first I'm waiting for a model I'm training on part of VoxCeleb2 (the full dataset doesn't fit in my ssd) to hit 1M training steps before I try this to see if there's any improvement.

@seungwonpark
Copy link
Owner

Nice point, but isn't it a problem of a generator? The generator architecture doesn't seem to have that kind of problem, but only discriminator does.

@bob80333
Copy link

At the end of that article, just before the conclusion, they found that discriminators with stride=2 in the first layer could also cause the generator to create the checkboard artifacts. The explanation was that some of the neurons in the generator will get many times the gradient due to the striding in the discriminator, and that helps create the artifacts.

I don't know if that would apply to this audio GAN, but it seems like a fairly simple thing to check. I have modified the discriminator in my fork, and I will start a training run tonight to see if it helps.

@bob80333
Copy link

Tested my fork out, the discriminator converges really fast, and the generator learns nothing.

Note the scales here

What the generator's output looks like:

Screen Shot 2019-11-15 at 9 58 21 AM

Screen Shot 2019-11-15 at 9 58 13 AM

Swapping from strided convolutions to downsampling appears to have made the discriminator much stronger, not sure how to fix that...

@seungwonpark
Copy link
Owner

I feel sorry to hear that.

Is using nn.Upsample for downsampling is okay? The documentation says

If you want downsampling/general resizing, you should use interpolate

Thanks for sharing your results, by the way.

@bob80333
Copy link

Oh! Nice catch, I missed that in the docs. I just fixed it in my fork, training is slightly better with this, but the discriminator still overpowers the generator quickly.

Discriminator converged in 2k steps rather than <500 steps.

Screen Shot 2019-11-15 at 4 03 43 PM

@geekboood
Copy link

geekboood commented Nov 17, 2019

@bob80333 Hi, I try to train the melgan on csmsc dataset, which is a single speaker dataset about 20 hours. My understand is the discriminator should converge pretty fast because at the very beginning the generator's result is indeed very easy to discern, since the result is very bad. And if you run for more epoches, you may find that the generator's result is improved at some time. Here is my tensorboard log.
SharedScreenshot
As you can see in the figure, the generator's loss stuck at around 120 before 300k iteration and after that, the loss is getting good. At the same time, the loss of the discriminator fluctuates a lot. I can hear something after 1.1M step, but it still with some artifacts. Maybe I should wait for 2M iteration.
Also I found that at the end of each audio, there exist a peak that generates the noise.

@bob80333
Copy link

Hey, thanks for the information! I have trained on my dataset (part of VoxCeleb2) with the current master branch for 1M steps and got this training curve:

Screen Shot 2019-11-16 at 9 23 54 PM

The results were understandable, but the voices themselves had artifacts while speaking, which is why I commented in this issue with ideas to fix it. The first modification I tried, I waited 80k steps, at which point the discriminator had gotten to 3.3e-5 loss and the generator was generating loud high pitched noises. I tried other approaches but the discriminator converged really quickly again, and I didn't want to wait to see if it failed, especially since my original training curve was very different from that.

@seungwonpark
Copy link
Owner

I've trained with fix/17 branch for 14 days (more than 6400 epochs) with LJSpeech-1.1 dataset, and the results don't have strange noise at unvoiced segment! I'll soon upload new audio samples(with pre-trained model, if possible), and merge fix/17 branch to master.

@seungwonpark
Copy link
Owner

Issues that were initially discussed here are now resolved, but I loved the idea and countless trials of @bob80333 to improve the quality.
Feel free to have more discussion here, or you may want to open a new issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants