-
-
Notifications
You must be signed in to change notification settings - Fork 16.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
different gpus to train #3736
Comments
@alicera for Multi-GPU it's recommended to train with even GPU counts (2, 4, 8) and with all the same exact model of GPU. |
But TiTANX and 1080ti should be used at the same time. |
@alicera well for starters Ultralytics will never be able to reproduce this error on this hardware combination, so there's no action for us to take. We've created a few short guidelines below to help users provide what we need in order to get started investigating a possible problem. How to create a Minimal, Reproducible ExampleWhen asking a question, people will be better able to provide help if you provide code that they can easily understand and use to reproduce the problem. This is referred to by community members as creating a minimum reproducible example. Your code that reproduces the problem should be:
In addition to the above requirements, for Ultralytics to provide assistance your code should be:
If you believe your problem meets all of the above criteria, please close this issue and raise a new one using the 🐛 Bug Report template and providing a minimum reproducible example to help us better understand and diagnose your problem. Thank you! 😃 |
Thank you! smiley |
docker: pytorch-21.03
Driver Version: 460.73.01
GPU:
CUDA:0 (GeForce GTX 1080 Ti, 11178.5MB)
CUDA:1 (GeForce GTX 1080 Ti, 11178.5MB)
CUDA:2 (GeForce GTX 1080 Ti, 11178.5MB)
CUDA:3 (GeForce GTX TITAN X, 12212.8125MB)
Command :python -m torch.distributed.launch --nproc_per_node 4 train.py --resume
Traceback (most recent call last):
File "train.py", line 541, in
train(hyp, opt, device, tb_writer)
File "train.py", line 304, in train
loss, loss_items = compute_loss(pred, targets.to(device)) # loss scaled by batch_size
RuntimeError: CUDA error: the launch timed out and was terminated
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: the launch timed out and was terminated
Exception raised from create_event_internal at ../c10/cuda/CUDACachingAllocator.cpp:733 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x6c (0x7fd165a3e5cc in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 0xfa (0x7fd165a04d4e in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x987 (0x7fd165a7f6f7 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0x5c (0x7fd165a244cc in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #4: std::vector<c10d::Reducer::Bucket, std::allocatorc10d::Reducer::Bucket >::~vector() + 0x29a (0x7fd1b2b3bd7a in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #5: c10d::Reducer::~Reducer() + 0x1c4 (0x7fd1b2b31444 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #6: std::_Sp_counted_ptr<c10d::Reducer*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 0x16 (0x7fd1b2b642c6 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #7: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x48 (0x7fd1b25ebf58 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #8: std::_Sp_counted_ptr<c10d::Logger*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 0x22 (0x7fd1b2b697f2 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #9: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x48 (0x7fd1b25ebf58 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #10: + 0xc700e5 (0x7fd1b2b680e5 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #11: + 0x6ff782 (0x7fd1b25f7782 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #12: + 0x700743 (0x7fd1b25f8743 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #13: + 0x12b785 (0x5565291cb785 in /opt/conda/bin/python)
frame #14: + 0x1ca984 (0x55652926a984 in /opt/conda/bin/python)
frame #15: + 0x11f906 (0x5565291bf906 in /opt/conda/bin/python)
frame #16: + 0x12bc96 (0x5565291cbc96 in /opt/conda/bin/python)
frame #17: + 0x12bc4c (0x5565291cbc4c in /opt/conda/bin/python)
frame #18: + 0x154ec8 (0x5565291f4ec8 in /opt/conda/bin/python)
frame #19: PyDict_SetItemString + 0x87 (0x5565291f6127 in /opt/conda/bin/python)
frame #20: PyImport_Cleanup + 0x9a (0x5565292f65aa in /opt/conda/bin/python)
frame #21: Py_FinalizeEx + 0x7d (0x5565292f694d in /opt/conda/bin/python)
frame #22: Py_RunMain + 0x110 (0x5565292f77f0 in /opt/conda/bin/python)
frame #23: Py_BytesMain + 0x39 (0x5565292f7979 in /opt/conda/bin/python)
frame #24: __libc_start_main + 0xf3 (0x7fd1e12bf0b3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #25: + 0x1e7185 (0x556529287185 in /opt/conda/bin/python)
0%| | 0/1195 [00:00<?, ?it/s]Killing subprocess 20868
Killing subprocess 20869
Killing subprocess 20870
Killing subprocess 20871
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 340, in
main()
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 326, in main
sigkill_handler(signal.SIGTERM, None) # not coming back
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/opt/conda/bin/python', '-u', 'train.py', '--local_rank=3', '--resume', 'runs/train/exp/weights/last.pt']' died with <Signals.SIGABRT: 6>.
The text was updated successfully, but these errors were encountered: