You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
您好,非常感谢您开源这么棒的项目,我在使用代码进行多机训练的时候,会经常出现RuntimeError: CUDA error: an illegal memory access was encountered 这一问题,并且出现的十分随机,请问这个报错是因为内存溢出吗?还是因为其他什么原因?
详细的报错如下,已经打开了export CUDA LAUNCH BLOCKING=1
Epoch 0: 35%|____ | 759/2142 [53:27<19:18:44, 50.27s/it, loss=0.457, step=749, global_step=749]Traceback (most recent call last):
File "/Open-Sora-v1.2-2.88B/scripts/train.py", line 551, in
main()
File "/Open-Sora-v1.2-2.88B/scripts/train.py", line 371, in main
booster.backward(loss=loss, optimizer=optimizer)
File "/root/miniconda3/envs/opensora/lib/python3.9/site-packages/colossalai/booster/booster.py", line 176, in backward
optimizer.backward(loss)
File "/root/miniconda3/envs/opensora/lib/python3.9/site-packages/colossalai/zero/low_level/low_level_optim.py", line 536, in backward
loss.backward(retain_graph=retain_graph)
File "/root/miniconda3/envs/opensora/lib/python3.9/site-packages/torch/_tensor.py", line 522, in backward
torch.autograd.backward(
File "/root/miniconda3/envs/opensora/lib/python3.9/site-packages/torch/autograd/init.py", line 266, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/root/miniconda3/envs/opensora/lib/python3.9/site-packages/colossalai/zero/low_level/low_level_optim.py", line 272, in grad_handler
LowLevelZeroOptimizer.add_to_bucket(param, group_id, bucket_store, param_store, grad_store)
File "/root/miniconda3/envs/opensora/lib/python3.9/site-packages/colossalai/zero/low_level/low_level_optim.py", line 519, in add_to_bucket
LowLevelZeroOptimizer.run_reduction(bucket_store, grad_store)
File "/root/miniconda3/envs/opensora/lib/python3.9/site-packages/colossalai/zero/low_level/low_level_optim.py", line 297, in run_reduction
bucket_store.build_grad_in_bucket()
File "/root/miniconda3/envs/opensora/lib/python3.9/site-packages/colossalai/zero/low_level/bookkeeping/bucket_store.py", line 106, in build_grad_in_bucket
grad_current_rank = grad_list[rank].clone().detach()
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
[rank63]:[E ProcessGroupNCCL.cpp:1182] [Rank 63] NCCL watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f2f83f84d87 in /root/miniconda3/envs/opensora/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f2f83f3575f in /root/miniconda3/envs/opensora/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f2f840558a8 in /root/miniconda3/envs/opensora/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x6c (0x7f2f851283ac in /root/miniconda3/envs/opensora/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7f2f8512c4c8 in /root/miniconda3/envs/opensora/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x15a (0x7f2f8512fbfa in /root/miniconda3/envs/opensora/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7f2f85130839 in /root/miniconda3/envs/opensora/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0xd3b55 (0x7f2fcee5ab55 in /root/miniconda3/envs/opensora/bin/../lib/libstdc++.so.6)
frame #8: + 0x8609 (0x7f2fcffcb609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #9: clone + 0x43 (0x7f2fcfd96133 in /usr/lib/x86_64-linux-gnu/libc.so.6)
terminate called after throwing an instance of 'c10::DistBackendError'
what(): [Rank 63] NCCL watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
期待您的回复!
The text was updated successfully, but these errors were encountered:
您好,非常感谢您开源这么棒的项目,我在使用代码进行多机训练的时候,会经常出现RuntimeError: CUDA error: an illegal memory access was encountered 这一问题,并且出现的十分随机,请问这个报错是因为内存溢出吗?还是因为其他什么原因?
详细的报错如下,已经打开了export CUDA LAUNCH BLOCKING=1
Epoch 0: 35%|____ | 759/2142 [53:27<19:18:44, 50.27s/it, loss=0.457, step=749, global_step=749]Traceback (most recent call last):
File "/Open-Sora-v1.2-2.88B/scripts/train.py", line 551, in
main()
File "/Open-Sora-v1.2-2.88B/scripts/train.py", line 371, in main
booster.backward(loss=loss, optimizer=optimizer)
File "/root/miniconda3/envs/opensora/lib/python3.9/site-packages/colossalai/booster/booster.py", line 176, in backward
optimizer.backward(loss)
File "/root/miniconda3/envs/opensora/lib/python3.9/site-packages/colossalai/zero/low_level/low_level_optim.py", line 536, in backward
loss.backward(retain_graph=retain_graph)
File "/root/miniconda3/envs/opensora/lib/python3.9/site-packages/torch/_tensor.py", line 522, in backward
torch.autograd.backward(
File "/root/miniconda3/envs/opensora/lib/python3.9/site-packages/torch/autograd/init.py", line 266, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/root/miniconda3/envs/opensora/lib/python3.9/site-packages/colossalai/zero/low_level/low_level_optim.py", line 272, in grad_handler
LowLevelZeroOptimizer.add_to_bucket(param, group_id, bucket_store, param_store, grad_store)
File "/root/miniconda3/envs/opensora/lib/python3.9/site-packages/colossalai/zero/low_level/low_level_optim.py", line 519, in add_to_bucket
LowLevelZeroOptimizer.run_reduction(bucket_store, grad_store)
File "/root/miniconda3/envs/opensora/lib/python3.9/site-packages/colossalai/zero/low_level/low_level_optim.py", line 297, in run_reduction
bucket_store.build_grad_in_bucket()
File "/root/miniconda3/envs/opensora/lib/python3.9/site-packages/colossalai/zero/low_level/bookkeeping/bucket_store.py", line 106, in build_grad_in_bucket
grad_current_rank = grad_list[rank].clone().detach()
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with
TORCH_USE_CUDA_DSA
to enable device-side assertions.[rank63]:[E ProcessGroupNCCL.cpp:1182] [Rank 63] NCCL watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with
TORCH_USE_CUDA_DSA
to enable device-side assertions.Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f2f83f84d87 in /root/miniconda3/envs/opensora/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f2f83f3575f in /root/miniconda3/envs/opensora/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f2f840558a8 in /root/miniconda3/envs/opensora/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x6c (0x7f2f851283ac in /root/miniconda3/envs/opensora/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7f2f8512c4c8 in /root/miniconda3/envs/opensora/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x15a (0x7f2f8512fbfa in /root/miniconda3/envs/opensora/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7f2f85130839 in /root/miniconda3/envs/opensora/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0xd3b55 (0x7f2fcee5ab55 in /root/miniconda3/envs/opensora/bin/../lib/libstdc++.so.6)
frame #8: + 0x8609 (0x7f2fcffcb609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #9: clone + 0x43 (0x7f2fcfd96133 in /usr/lib/x86_64-linux-gnu/libc.so.6)
terminate called after throwing an instance of 'c10::DistBackendError'
what(): [Rank 63] NCCL watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
期待您的回复!
The text was updated successfully, but these errors were encountered: