-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Killed #114
Comments
This may be also caused by the memory leaky bug #103. We can reproduce this bug and fixing is on the way. |
I am getting this pretty consistently, too. I think it could be related to the memory leak because if I run on a single GPU, I get an "out of memory" error, but if I run on multiple GPUs, I get something similar to this. More specifically:
(and I already tried setting the eval_loader's num_workers to 0) |
Yes, I think we may trigger the memory leak bug of pytorch dataloader like in pytorch/pytorch#13246 And I am trying some solutions listed in the above issues, but currently, some of them didn't work. |
Ok, thanks for the info. I'll look into it some, too. If I find anything that works, I'll let you know 👍 |
hello, when i train nano on single GPU by python tools/train.py -f exps/default/nano.py -d 1 -b 8 --fp16 -o, it went out (epoch 7):
2021-07-23 03:52:26 | INFO | yolox.core.trainer:245 - epoch: 7/300, iter: 12700/14786, mem: 20340Mb, iter_time: 0.679s, data_time: 0.493s, total_loss: 8.6, iou_loss: 2.5, l1_loss: 0.0, conf_loss: 3.8, cls_loss: 2.3, lr: 1.250e-03, size: 416, ETA: 11 days, 13:25:24
Killed
The text was updated successfully, but these errors were encountered: