Skip to content
This repository has been archived by the owner on Mar 15, 2024. It is now read-only.

DeiT Training Errors out after 8 epochs #102

Closed
molocule opened this issue Jul 14, 2021 · 2 comments
Closed

DeiT Training Errors out after 8 epochs #102

molocule opened this issue Jul 14, 2021 · 2 comments

Comments

@molocule
Copy link

molocule commented Jul 14, 2021

Hey DeiT team! Thanks so much for open sourcing the codebase for the DeiT paper! I was trying to reproduce results for the deit_base_patch16_224 model to see training curves and play around with hyper parameters, but I noticed that my job failed once it reached Epoch 8. I tried re-running 3 times, but it always died with the same error at the same point (Epoch 8 650/1250 batches).

To run, I followed the steps from the repository and ran the following command:
python run_with_submitit.py --model deit_base_patch16_224 --use_volta32

I thought it could be a memory issue, but I see that the log prints:
use_volta32=True

I recieve this error:

srun: error: learnfair0824: task 5: Exited with exit code 1
srun: launch/slurm: _step_signal: Terminating StepId=43575862.0
slurmstepd: error: *** STEP 43575862.0 ON learnfair0824 CANCELLED AT 2021-07-14T08:00:36 ***
srun: error: learnfair0824: tasks 3,7: Killed
srun: error: learnfair0833: task 8: Killed
srun: error: learnfair0824: task 1: Killed
srun: error: learnfair0824: tasks 0,2,6: Killed
srun: error: learnfair0833: tasks 9-10,12-15: Killed
srun: error: learnfair0824: task 4: Killed
srun: error: learnfair0833: task 11: Killed
srun: Force Terminated StepId=43575862.0

Any help would be appreciated!! Thank you!

@TouvronHugo
Copy link
Contributor

Hi @molocule ,
Thanks for your question, do you have more details about the error in the logs of other GPUs?
If the error is related to the presence of a NaN, the information in this issue may be useful
Best,
Hugo

@molocule
Copy link
Author

Hi @TouvronHugo !

Hope you are doing well, thank you for the response :))

No, only the rank_0 log has any details on the error (this is the extent of the details in that log, everything else is a warning).

For the last few lines of the training log, I see that:

Epoch: [8] [ 630/1251] eta: 0:03:19 lr: 0.000999 loss: 6.6433 (6.5852) time: 0.3171 data: 0.0002 max mem: 10092 Epoch: [8] [ 640/1251] eta: 0:03:16 lr: 0.000999 loss: 6.6181 (6.5835) time: 0.3181 data: 0.0002 max mem: 10092 Epoch: [8] [ 650/1251] eta: 0:03:12 lr: 0.000999 loss: 6.4880 (6.5829) time: 0.3182 data: 0.0002 max mem: 10092

So it does not appear to be a NaN issue. Please let me know if I am missing something

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants