DeiT Training Errors out after 8 epochs #102

molocule · 2021-07-14T17:13:48Z

Hey DeiT team! Thanks so much for open sourcing the codebase for the DeiT paper! I was trying to reproduce results for the deit_base_patch16_224 model to see training curves and play around with hyper parameters, but I noticed that my job failed once it reached Epoch 8. I tried re-running 3 times, but it always died with the same error at the same point (Epoch 8 650/1250 batches).

To run, I followed the steps from the repository and ran the following command:
python run_with_submitit.py --model deit_base_patch16_224 --use_volta32

I thought it could be a memory issue, but I see that the log prints:
use_volta32=True

I recieve this error:

srun: error: learnfair0824: task 5: Exited with exit code 1
srun: launch/slurm: _step_signal: Terminating StepId=43575862.0
slurmstepd: error: *** STEP 43575862.0 ON learnfair0824 CANCELLED AT 2021-07-14T08:00:36 ***
srun: error: learnfair0824: tasks 3,7: Killed
srun: error: learnfair0833: task 8: Killed
srun: error: learnfair0824: task 1: Killed
srun: error: learnfair0824: tasks 0,2,6: Killed
srun: error: learnfair0833: tasks 9-10,12-15: Killed
srun: error: learnfair0824: task 4: Killed
srun: error: learnfair0833: task 11: Killed
srun: Force Terminated StepId=43575862.0

Any help would be appreciated!! Thank you!

The text was updated successfully, but these errors were encountered:

TouvronHugo · 2021-07-19T08:09:21Z

Hi @molocule ,
Thanks for your question, do you have more details about the error in the logs of other GPUs?
If the error is related to the presence of a NaN, the information in this issue may be useful
Best,
Hugo

molocule · 2021-07-19T13:41:56Z

Hi @TouvronHugo !

Hope you are doing well, thank you for the response :))

No, only the rank_0 log has any details on the error (this is the extent of the details in that log, everything else is a warning).

For the last few lines of the training log, I see that:

Epoch: [8] [ 630/1251] eta: 0:03:19 lr: 0.000999 loss: 6.6433 (6.5852) time: 0.3171 data: 0.0002 max mem: 10092 Epoch: [8] [ 640/1251] eta: 0:03:16 lr: 0.000999 loss: 6.6181 (6.5835) time: 0.3181 data: 0.0002 max mem: 10092 Epoch: [8] [ 650/1251] eta: 0:03:12 lr: 0.000999 loss: 6.4880 (6.5829) time: 0.3182 data: 0.0002 max mem: 10092

So it does not appear to be a NaN issue. Please let me know if I am missing something

TouvronHugo closed this as completed Aug 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DeiT Training Errors out after 8 epochs #102

DeiT Training Errors out after 8 epochs #102

molocule commented Jul 14, 2021 •

edited

Loading

TouvronHugo commented Jul 19, 2021

molocule commented Jul 19, 2021

DeiT Training Errors out after 8 epochs #102

DeiT Training Errors out after 8 epochs #102

Comments

molocule commented Jul 14, 2021 • edited Loading

TouvronHugo commented Jul 19, 2021

molocule commented Jul 19, 2021

molocule commented Jul 14, 2021 •

edited

Loading