Resume from the latest pickle #6

woctezuma · 2020-10-13T07:53:02Z

Hello,

I know you don't accept pull requests. However:

this could be of interest to others who want to run the code on Google Colab,
this is the first place where they will look for such a change.

I have added the ability to resume from the latest .pkl file with the command-line argument --resume=latest.
The value of cur_nimg is inferred from the file name.
I have yet to figure out how to automatically compute the relevant value of aug.strength to resume from.

YukiSakuma · 2020-10-13T09:21:49Z

I have yet to figure out how to automatically compute the relevant value of aug.strength to resume from.

You know I was about to question this since every resume training the augment value always starts at 0.0

woctezuma · 2020-10-13T11:46:32Z

I have enabled the option to manually set the initial augmentation strength, but I don't know if it is the right way to tackle the issue. For long training sessions on Colab, it may not be a big issue as the strength is supposed to increase quite fast.

Here is a quote from the article (section 3 on page 5):

For now, I will keep using the initial strength set to 0. In my latest experience, a strength of 0.5 is reached after about an hour.

It took 5 hours to reach strength equal to 1!

This happened with ~425k img, which matches:

the article which mentions 500k img,
the code:

stylegan2-ada/training/augment.py

Line 34 in a831de2

tune_kimg = 500, # Adjustment speed, measured in how many kimg it takes for the strength to increase/decrease by one unit.

It might be worth setting the initial strength for resuming. It would have to be done manually then.

100330706 · 2020-11-06T16:10:50Z

I have enabled the option to manually set the initial augmentation strength, but I don't know if it is the right way to tackle the issue. For long training sessions on Colab, it may not be a big issue as the strength is supposed to increase quite fast.

Here is a quote from the article (section 3 on page 5):

For now, I will keep using the initial strength set to 0. In my latest experience, a strength of 0.5 is reached after about an hour.

It took 5 hours to reach strength equal to 1!

This happened with ~425k img, which matches:

the article which mentions 500k img,

the code:

stylegan2-ada/training/augment.py

Line 34 in a831de2

tune_kimg = 500, # Adjustment speed, measured in how many kimg it takes for the strength to increase/decrease by one unit.

It might be worth setting the initial strength for resuming. It would have to be done manually then.

@woctezuma Did you solve the aug strength going above 1 thing? (in your tick 58 of the image you posted it was 1.005)

woctezuma · 2020-11-06T16:31:49Z

@woctezuma Did you solve the aug strength going above 1 thing? (in your tick 58 of the image you posted it was 1.005)

Actually, I have scrapped this training run, because I had messed up the mapping net depth (--cfg_map): I was transfer-learning from the dog snapshot (which uses --cfg_map=8) and I was using the auto config (which uses --cfg_map=2).

Anyway, in my subsequent training runs, I have fixed the value of mapping net depth, and I have also deactivated EMA rampup, and the augmentation strength has never been over 1 (or close to 1). For most of the training run, the augmentation strength was stable around 0.6, usually a bit higher than 0.6.

After 5000 kimg, when I stopped the training run to analyze the results, the augmentation strength was at 0.736.

I think the culprit was the EMA rampup, but I cannot say for sure, because I have simultaneously changed a few settings and I have not run many experiments.

100330706 · 2020-11-07T11:42:13Z

@woctezuma Thanks for your insight. I think the rampup can't be the culprit as I was using the following cfg with ramp deactivated:

dict(ref_gpus=1, kimg=25000, mb=4, mbstd=4, fmaps=1, lrate=0.001, gamma=10, ema=10, ramp=None, map=8)

I've done a couple more experiments and I've realized that using "bgc" instead of "bg" for the augmentation pipeline has slowed down the aug strength increasing a lot:

using bgc

using bg

However, from an old training, I've also observed that aug strength can still go crazy above 1 at the final stages of training, as at those stages overfitting is prone to occur. So, I think clipping the augm strength to some value below 1 (probably clipping it to the target value) can be beneficial. I think with ffhq you don't run into these problems but with more specific datasets you always run into many problems :S

100330706 · 2020-11-07T11:44:04Z

@woctezuma Just as a sidenote, do you monitor losses using tensorboard? I'm having a hard time using it with Colab :(

woctezuma · 2020-11-07T13:06:08Z

I think with ffhq you don't run into these problems but with more specific datasets you always run into many problems :S

You could be right.
One thing which I forgot to mention and which might have helped a lot is freezing the first layers of the discriminator.

Edit: I see #27 is about that issue.

@woctezuma Just as a sidenote, do you monitor losses using tensorboard? I'm having a hard time using it with Colab :(

I am not good with monitoring the losses in Colab. I have deactivated the metrics during the training run, and I manually check the metrics for the major snapshots, lately every 1000 kimg (but I should have done it every 100 kimg in my opinion).

100330706 · 2020-11-07T13:13:32Z

@woctezuma Yeah I am currently freezing the 3 first layers. How many layers are you freezing? In this paper they found 4 as the optimum point for stylegan2: https://arxiv.org/pdf/2002.10964.pdf

What metric do you use? For me the fid50k_full takes too much time so I actually modified it to be a fid5k.

woctezuma · 2020-11-07T13:29:57Z

@woctezuma Yeah I am currently freezing the 3 first layers. How many layers are you freezing? In this paper they found 4 as the optimum point for stylegan2: https://arxiv.org/pdf/2002.10964.pdf

I use k=10, because it is what worked best for StyleGAN2-ADA in Nvidia's paper at 256x256 resolution.
Figure 9 is 256x256 resolution, figure 11b is 32x32 resolution (CIFAR-10).

I might be wrong though, because I have only skimmed through the paper.

What metric do you use? For me the fid50k_full takes too much time so I actually modified it to be a fid5k.

None. I don't monitor metrics at all during the training. I only do it manually afterwards for manually defined milestones.

generate.py

woctezuma · 2021-02-02T08:24:41Z

Closing in favour of the PyTorch implementation here: NVlabs/stylegan2-ada-pytorch#3

woctezuma force-pushed the google-colab branch 2 times, most recently from 1433c88 to 74f839d Compare October 13, 2020 08:53

woctezuma force-pushed the google-colab branch from 74f839d to b8e2c73 Compare October 13, 2020 19:17

woctezuma mentioned this pull request Oct 15, 2020

ffhq1024 fakes_init look bad #3

Closed

woctezuma force-pushed the google-colab branch from be38464 to 49a60d1 Compare October 24, 2020 08:07

woctezuma added 7 commits November 3, 2020 21:17

Save output as JPG instead of PNG

9089250

Add utility functions

5bd4814

Resume from the latest pickle

7c4e8f1

Automatically set the resume value of kimg

dd0c0a5

Allow to manually set the resume value of the augmentation strength

37b33d2

Add cfg (auto_no_ramp): auto cfg without EMA rampup

4220d01

Allow to override mapping net depth with --cfg_map

193e1ba

woctezuma force-pushed the google-colab branch from 49a60d1 to 193e1ba Compare November 3, 2020 20:18

Allow to enforce CIFAR-specific architecture tuning with --cifar_tuning

736b8a7

8secz-johndpope pushed a commit to johndpope/stylegan2-ada that referenced this pull request Dec 29, 2020

Merge pull request NVlabs#6 from ekkolabs/main

f2bcbb5

generate.py

This was referenced Feb 1, 2021

Resume from the latest pickle NVlabs/stylegan2-ada-pytorch#1

Closed

Quality-of-Life for Google Colab NVlabs/stylegan2-ada-pytorch#3

Draft

woctezuma closed this Feb 2, 2021

woctezuma mentioned this pull request Nov 10, 2021

Training starts from 0 after resuming, even though the log says it resumed. #30

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resume from the latest pickle #6

Resume from the latest pickle #6

woctezuma commented Oct 13, 2020

YukiSakuma commented Oct 13, 2020

woctezuma commented Oct 13, 2020 •

edited

Loading

100330706 commented Nov 6, 2020

woctezuma commented Nov 6, 2020

100330706 commented Nov 7, 2020

100330706 commented Nov 7, 2020

woctezuma commented Nov 7, 2020 •

edited

Loading

100330706 commented Nov 7, 2020

woctezuma commented Nov 7, 2020 •

edited

Loading

woctezuma commented Feb 2, 2021

Resume from the latest pickle #6

Resume from the latest pickle #6

Conversation

woctezuma commented Oct 13, 2020

YukiSakuma commented Oct 13, 2020

woctezuma commented Oct 13, 2020 • edited Loading

100330706 commented Nov 6, 2020

woctezuma commented Nov 6, 2020

100330706 commented Nov 7, 2020

100330706 commented Nov 7, 2020

woctezuma commented Nov 7, 2020 • edited Loading

100330706 commented Nov 7, 2020

woctezuma commented Nov 7, 2020 • edited Loading

woctezuma commented Feb 2, 2021

woctezuma commented Oct 13, 2020 •

edited

Loading

woctezuma commented Nov 7, 2020 •

edited

Loading

woctezuma commented Nov 7, 2020 •

edited

Loading