Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training instability with motorcycle #138

Open
pallgeuer opened this issue Aug 3, 2023 · 3 comments
Open

Training instability with motorcycle #138

pallgeuer opened this issue Aug 3, 2023 · 3 comments

Comments

@pallgeuer
Copy link

Hi, I am trying to use this code to train on the motorcycle data but the training is proving to be unstable. I have done the blender renders as described and have all 337 models with 96 renders per model. I train as follows:

python train_3d.py --outdir=OUTPATH --data=RENDER/img/03790512 --camera_path=RENDER/camera --gpus=2 --batch=32 --batch-gpu=16 --mbstd-group=4 --gamma=80 --data_camera_mode=shapenet_motorbike --img_res=1024 --dmtet_scale=1.0 --use_shapenet_split=1 --one_3d_generator=1 --fp32=0 --workers=4

This should be essentially the default documented training parameters except that I'm running on 2xA100 instead of 8xA100.

My issue is that the FID50k only decreases from ~250 initially to ~85 (more than the 50-65 expected from the paper), and at around 2000-3000kimg (out of the planned 20000kimg) the training diverges and never recovers. What parameters should I use so that your code on your data can actually finish training?

I would also be interested to know what the differences are between the code and training commands provided in this github, and the one that was used to train the pretrained motorcycle model. For one, volume subdivision isn't implemented, but what else (e.g. R1 regularization, SDF regularization, single vs two discriminators)? The paper also says Adam beta = 0.9, but the code uses (0, 0.99) (!) which is puzzling.

@SteveJunGao
Copy link
Collaborator

Hi @pallgeuer, thanks for the great questions!

I think the divergence might be because the GAN is unstable to train. There're several opinions you can try to increase stability:

  • Increase the batch size
  • Turn off the fp16 training by doing --fp32=1 (we recommend to use --fp32 0 only when you want the speed)
  • The most effective regularization is the R1 regularization from the StyleGAN, you can increase the --gamma when you see the instability (e.g. make it 2x larger)
  • We also observe the model can sometimes diverge during training, you can just restart the training and it will often disappear.

Re volume subdivision: Unfortunately, we didn't include the volume subdivision code in this codebase, our paper has the ablation studies when removing the volume subdivision (Table 2) and the released pre-trained model is trained without volume subdivision.

Re R1 regularization: the gamma we used is 80 for motorbikes.

Re SDF regularization: which regularization do you mean exactly? we do have one regularization in the paper (Eq 2) and the hyperparameter for this is fixed in the code for all the experiments

Re single v.s. two discriminators: we always use two discriminators in all the experiments (except when we do ablation studies to compare the effect of two discriminators v.s. a single discriminator)

Re Adam beta. We apologize for the typo in the paper, the (0, 0.99) in the code is the correct one.

@pallgeuer
Copy link
Author

Hi, many thanks for the detailed answer.

My original diverging trainings were with a batch size of 64, which is close to the most I can fit on 2xA100 (96 fits but starts showing symptoms of hitting against the GPU memory limit). Is by any chance gradient accumulation implemented to allow higher batch sizes?

Was the pretrained model trained with --fp32=1?

Weirdly, a training run that set gamma=40 instead of gamma=80 was the first run that has made it to 6000kimg and is currently still converging.

I got the name "SDF regularization" from the paper:

We follow StyleGAN2 [35] and use lazy regularization, which applies R1 regularization to discriminators only every 16 training steps. Finally, we set the hyperparameter µ that controls the SDF regularization to 0.01 in all the experiments.

But yes, this is exactly the loss contribution described in Eqns 2 & 3 like you said.

Okay, so this github repo by default uses two discriminators when called with parameters like I specified?

Was the choice of Adam betas just inherited from another project, or did initial tests with beta1 >= 0.5 show that it had a detrimental effect? Was training with it unstable? Has a learning rate scheduler that reduces the learning rate over time been tested?

@jingyang2017
Copy link

Hi, I meet the similar issue when I am trying to use this code to train on Chair,
python train_3d.py --outdir='./results/' --data='/home/XXX/projects/XXX/Datasets/GET3D/ShapeNet/img/03001627' --camera_path /home/XXX/projects/XXX/Datasets/GET3D/ShapeNet/camera/ --gpus=8 --batch=32 --gamma=400 --data_camera_mode shapenet_chair --dmtet_scale 0.8 --use_shapenet_split 1 --one_3d_generator 1 --fp32 0

The following are the fid scores during training:
{"results": {"fid50k": 243.90716463299503}, "metric": "fid50k", "total_time": 226.7174837589264, "total_time_str": "3m 47s", "num_gpus": 8, "snapshot_pkl": "network-snapshot-000000.pkl", "timestamp": 1693757250.5896306}
{"results": {"fid50k": 85.12321071542893}, "metric": "fid50k", "total_time": 210.83970594406128, "total_time_str": "3m 31s", "num_gpus": 8, "snapshot_pkl": "network-snapshot-000204.pkl", "timestamp": 1693764915.7036355}
{"results": {"fid50k": 47.563969561269744}, "metric": "fid50k", "total_time": 233.97653555870056, "total_time_str": "3m 54s", "num_gpus": 8, "snapshot_pkl": "network-snapshot-000409.pkl", "timestamp": 1693772572.6469076}
{"results": {"fid50k": 42.0378087379054}, "metric": "fid50k", "total_time": 211.6967008113861, "total_time_str": "3m 32s", "num_gpus": 8, "snapshot_pkl": "network-snapshot-000614.pkl", "timestamp": 1693780185.751803}
{"results": {"fid50k": 40.741863425884134}, "metric": "fid50k", "total_time": 211.70320200920105, "total_time_str": "3m 32s", "num_gpus": 8, "snapshot_pkl": "network-snapshot-000819.pkl", "timestamp": 1693787801.4226646}
{"results": {"fid50k": 36.727746342948834}, "metric": "fid50k", "total_time": 211.39422988891602, "total_time_str": "3m 31s", "num_gpus": 8, "snapshot_pkl": "network-snapshot-001024.pkl", "timestamp": 1693795420.8913603}
{"results": {"fid50k": 35.36935289811818}, "metric": "fid50k", "total_time": 211.84103798866272, "total_time_str": "3m 32s", "num_gpus": 8, "snapshot_pkl": "network-snapshot-001228.pkl", "timestamp": 1693803039.0559103}
{"results": {"fid50k": 34.56291491733728}, "metric": "fid50k", "total_time": 213.0910336971283, "total_time_str": "3m 33s", "num_gpus": 8, "snapshot_pkl": "network-snapshot-001433.pkl", "timestamp": 1693810644.6885796}
{"results": {"fid50k": 231.98312384110938}, "metric": "fid50k", "total_time": 228.0975947380066, "total_time_str": "3m 48s", "num_gpus": 8, "snapshot_pkl": "network-snapshot-001638.pkl", "timestamp": 1693818536.069743}
{"results": {"fid50k": 218.79605704254513}, "metric": "fid50k", "total_time": 220.4409465789795, "total_time_str": "3m 40s", "num_gpus": 8, "snapshot_pkl": "network-snapshot-001843.pkl", "timestamp": 1693826533.7842581}
{"results": {"fid50k": 183.94286614489346}, "metric": "fid50k", "total_time": 233.28316688537598, "total_time_str": "3m 53s", "num_gpus": 8, "snapshot_pkl": "network-snapshot-002048.pkl", "timestamp": 1693834372.6499712}
The fid scores are unstable and the model tends to collapse after longer training.

I have also calculated the checkpoint provided in https://drive.google.com/drive/folders/1oJ-FmyVYjIwBZKDAQ4N1EEcE9dJjumdW,
CUDA_VISIBLE_DEVICES=0 python train_3d.py --outdir=save_inference_results/shapenet_chair --gpus=1 --batch=32 --gamma=400 --data_camera_mode shapenet_chair --dmtet_scale 0.8 --use_shapenet_split 1 --one_3d_generator 1 --fp32 0 --inference_vis 1 --resume_pretrain weights/shapenet_chair.pt --inference_compute_fid 1 --data='/home/XXX/projects/XXX/Datasets/GET3D/ShapeNet/img/03001627' --camera_path /home/XXX/projects/XXX/Datasets/GET3D/ShapeNet/camera/

{"results": {"fid50k": 22.706035931177578}, "metric": "fid50k", "total_time": 1566.5149657726288, "total_time_str": "26m 07s", "num_gpus": 1, "snapshot_pkl": "weights/shapenet_chair.pt", "timestamp": 1693843925.6001537}

The best model I achieved is network-snapshot-001433.pkl which gets

{"results": {"fid50k": 28.708304000589685}, "metric": "fid50k", "total_time": 1631.6883997917175, "total_time_str": "27m 12s", "num_gpus": 1, "snapshot_pkl": "../../../results/00001-stylegan2-03001627-gpus8-batch32-gamma400/network-snapshot-001433.pt", "timestamp": 1693845837.4513173}

Is there any problem in my training setting?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants