RuntimeError: CUDA error: an illegal memory access was encountered #694

CacacaLalala · 2024-09-06T07:35:43Z

您好，非常感谢您开源这么棒的项目，我在使用代码进行多机训练的时候，会经常出现RuntimeError: CUDA error: an illegal memory access was encountered 这一问题，并且出现的十分随机，请问这个报错是因为内存溢出吗？还是因为其他什么原因？
详细的报错如下，已经打开了export CUDA LAUNCH BLOCKING=1

Epoch 0: 35%|____ | 759/2142 [53:27<19:18:44, 50.27s/it, loss=0.457, step=749, global_step=749]Traceback (most recent call last):
File "/Open-Sora-v1.2-2.88B/scripts/train.py", line 551, in
main()
File "/Open-Sora-v1.2-2.88B/scripts/train.py", line 371, in main
booster.backward(loss=loss, optimizer=optimizer)
File "/root/miniconda3/envs/opensora/lib/python3.9/site-packages/colossalai/booster/booster.py", line 176, in backward
optimizer.backward(loss)
File "/root/miniconda3/envs/opensora/lib/python3.9/site-packages/colossalai/zero/low_level/low_level_optim.py", line 536, in backward
loss.backward(retain_graph=retain_graph)
File "/root/miniconda3/envs/opensora/lib/python3.9/site-packages/torch/_tensor.py", line 522, in backward
torch.autograd.backward(
File "/root/miniconda3/envs/opensora/lib/python3.9/site-packages/torch/autograd/init.py", line 266, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/root/miniconda3/envs/opensora/lib/python3.9/site-packages/colossalai/zero/low_level/low_level_optim.py", line 272, in grad_handler
LowLevelZeroOptimizer.add_to_bucket(param, group_id, bucket_store, param_store, grad_store)
File "/root/miniconda3/envs/opensora/lib/python3.9/site-packages/colossalai/zero/low_level/low_level_optim.py", line 519, in add_to_bucket
LowLevelZeroOptimizer.run_reduction(bucket_store, grad_store)
File "/root/miniconda3/envs/opensora/lib/python3.9/site-packages/colossalai/zero/low_level/low_level_optim.py", line 297, in run_reduction
bucket_store.build_grad_in_bucket()
File "/root/miniconda3/envs/opensora/lib/python3.9/site-packages/colossalai/zero/low_level/bookkeeping/bucket_store.py", line 106, in build_grad_in_bucket
grad_current_rank = grad_list[rank].clone().detach()
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

[rank63]:[E ProcessGroupNCCL.cpp:1182] [Rank 63] NCCL watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f2f83f84d87 in /root/miniconda3/envs/opensora/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f2f83f3575f in /root/miniconda3/envs/opensora/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f2f840558a8 in /root/miniconda3/envs/opensora/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x6c (0x7f2f851283ac in /root/miniconda3/envs/opensora/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7f2f8512c4c8 in /root/miniconda3/envs/opensora/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x15a (0x7f2f8512fbfa in /root/miniconda3/envs/opensora/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7f2f85130839 in /root/miniconda3/envs/opensora/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0xd3b55 (0x7f2fcee5ab55 in /root/miniconda3/envs/opensora/bin/../lib/libstdc++.so.6)
frame #8: + 0x8609 (0x7f2fcffcb609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #9: clone + 0x43 (0x7f2fcfd96133 in /usr/lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
what(): [Rank 63] NCCL watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

期待您的回复！

The text was updated successfully, but these errors were encountered:

github-actions · 2024-09-14T01:55:32Z

This issue is stale because it has been open for 7 days with no activity.

github-actions · 2024-09-21T01:56:26Z

This issue was closed because it has been inactive for 7 days since being marked as stale.

JonathanLi19 · 2024-10-01T07:33:40Z

遇到了同样的问题，请问你解决了吗？

github-actions bot added the stale label Sep 14, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: CUDA error: an illegal memory access was encountered #694

RuntimeError: CUDA error: an illegal memory access was encountered #694

CacacaLalala commented Sep 6, 2024

github-actions bot commented Sep 14, 2024

github-actions bot commented Sep 21, 2024

JonathanLi19 commented Oct 1, 2024

RuntimeError: CUDA error: an illegal memory access was encountered #694

RuntimeError: CUDA error: an illegal memory access was encountered #694

Comments

CacacaLalala commented Sep 6, 2024

github-actions bot commented Sep 14, 2024

github-actions bot commented Sep 21, 2024

JonathanLi19 commented Oct 1, 2024