Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resume from the latest pickle #6

Closed
wants to merge 8 commits into from

Conversation

woctezuma
Copy link

Hello,

I know you don't accept pull requests. However:

  • this could be of interest to others who want to run the code on Google Colab,
  • this is the first place where they will look for such a change.

I have added the ability to resume from the latest .pkl file with the command-line argument --resume=latest.
The value of cur_nimg is inferred from the file name.
I have yet to figure out how to automatically compute the relevant value of aug.strength to resume from.

@woctezuma woctezuma force-pushed the google-colab branch 2 times, most recently from 1433c88 to 74f839d Compare October 13, 2020 08:53
@YukiSakuma
Copy link

I have yet to figure out how to automatically compute the relevant value of aug.strength to resume from.

You know I was about to question this since every resume training the augment value always starts at 0.0

@woctezuma
Copy link
Author

woctezuma commented Oct 13, 2020

I have enabled the option to manually set the initial augmentation strength, but I don't know if it is the right way to tackle the issue. For long training sessions on Colab, it may not be a big issue as the strength is supposed to increase quite fast.

Here is a quote from the article (section 3 on page 5):

Article Section 3 page 5

For now, I will keep using the initial strength set to 0. In my latest experience, a strength of 0.5 is reached after about an hour.

training log with strength 0.5

It took 5 hours to reach strength equal to 1!

training log with strength 1.0

This happened with ~425k img, which matches:

  • the article which mentions 500k img,
  • the code:
    tune_kimg = 500, # Adjustment speed, measured in how many kimg it takes for the strength to increase/decrease by one unit.

It might be worth setting the initial strength for resuming. It would have to be done manually then.

@100330706
Copy link

I have enabled the option to manually set the initial augmentation strength, but I don't know if it is the right way to tackle the issue. For long training sessions on Colab, it may not be a big issue as the strength is supposed to increase quite fast.

Here is a quote from the article (section 3 on page 5):

Article Section 3 page 5

For now, I will keep using the initial strength set to 0. In my latest experience, a strength of 0.5 is reached after about an hour.

training log with strength 0.5

It took 5 hours to reach strength equal to 1!

training log with strength 1.0

This happened with ~425k img, which matches:

  • the article which mentions 500k img,
  • the code:
    tune_kimg = 500, # Adjustment speed, measured in how many kimg it takes for the strength to increase/decrease by one unit.

It might be worth setting the initial strength for resuming. It would have to be done manually then.

@woctezuma Did you solve the aug strength going above 1 thing? (in your tick 58 of the image you posted it was 1.005)

@woctezuma
Copy link
Author

@woctezuma Did you solve the aug strength going above 1 thing? (in your tick 58 of the image you posted it was 1.005)

Actually, I have scrapped this training run, because I had messed up the mapping net depth (--cfg_map): I was transfer-learning from the dog snapshot (which uses --cfg_map=8) and I was using the auto config (which uses --cfg_map=2).

Anyway, in my subsequent training runs, I have fixed the value of mapping net depth, and I have also deactivated EMA rampup, and the augmentation strength has never been over 1 (or close to 1). For most of the training run, the augmentation strength was stable around 0.6, usually a bit higher than 0.6.

augmentation strength for each training run

After 5000 kimg, when I stopped the training run to analyze the results, the augmentation strength was at 0.736.

augmentation strength after 5000 kimg

I think the culprit was the EMA rampup, but I cannot say for sure, because I have simultaneously changed a few settings and I have not run many experiments.

@100330706
Copy link

@woctezuma Thanks for your insight. I think the rampup can't be the culprit as I was using the following cfg with ramp deactivated:

dict(ref_gpus=1, kimg=25000, mb=4, mbstd=4, fmaps=1, lrate=0.001, gamma=10, ema=10, ramp=None, map=8)

I've done a couple more experiments and I've realized that using "bgc" instead of "bg" for the augmentation pipeline has slowed down the aug strength increasing a lot:

using bgc
bgc

using bg
bg

However, from an old training, I've also observed that aug strength can still go crazy above 1 at the final stages of training, as at those stages overfitting is prone to occur. So, I think clipping the augm strength to some value below 1 (probably clipping it to the target value) can be beneficial. I think with ffhq you don't run into these problems but with more specific datasets you always run into many problems :S

@100330706
Copy link

@woctezuma Just as a sidenote, do you monitor losses using tensorboard? I'm having a hard time using it with Colab :(

@woctezuma
Copy link
Author

woctezuma commented Nov 7, 2020

I think with ffhq you don't run into these problems but with more specific datasets you always run into many problems :S

You could be right.
One thing which I forgot to mention and which might have helped a lot is freezing the first layers of the discriminator.

Edit: I see #27 is about that issue.

@woctezuma Just as a sidenote, do you monitor losses using tensorboard? I'm having a hard time using it with Colab :(

I am not good with monitoring the losses in Colab. I have deactivated the metrics during the training run, and I manually check the metrics for the major snapshots, lately every 1000 kimg (but I should have done it every 100 kimg in my opinion).

@100330706
Copy link

@woctezuma Yeah I am currently freezing the 3 first layers. How many layers are you freezing? In this paper they found 4 as the optimum point for stylegan2: https://arxiv.org/pdf/2002.10964.pdf

What metric do you use? For me the fid50k_full takes too much time so I actually modified it to be a fid5k.

@woctezuma
Copy link
Author

woctezuma commented Nov 7, 2020

@woctezuma Yeah I am currently freezing the 3 first layers. How many layers are you freezing? In this paper they found 4 as the optimum point for stylegan2: https://arxiv.org/pdf/2002.10964.pdf

I use k=10, because it is what worked best for StyleGAN2-ADA in Nvidia's paper at 256x256 resolution.
Figure 9 is 256x256 resolution, figure 11b is 32x32 resolution (CIFAR-10).

paper

I might be wrong though, because I have only skimmed through the paper.

What metric do you use? For me the fid50k_full takes too much time so I actually modified it to be a fid5k.

None. I don't monitor metrics at all during the training. I only do it manually afterwards for manually defined milestones.

8secz-johndpope pushed a commit to johndpope/stylegan2-ada that referenced this pull request Dec 29, 2020
@woctezuma
Copy link
Author

Closing in favour of the PyTorch implementation here: NVlabs/stylegan2-ada-pytorch#3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants