Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: CUDA error: an illegal memory access was encountered #694

Closed
CacacaLalala opened this issue Sep 6, 2024 · 3 comments
Closed
Labels

Comments

@CacacaLalala
Copy link

您好,非常感谢您开源这么棒的项目,我在使用代码进行多机训练的时候,会经常出现RuntimeError: CUDA error: an illegal memory access was encountered 这一问题,并且出现的十分随机,请问这个报错是因为内存溢出吗?还是因为其他什么原因?
详细的报错如下,已经打开了export CUDA LAUNCH BLOCKING=1

Epoch 0: 35%|____ | 759/2142 [53:27<19:18:44, 50.27s/it, loss=0.457, step=749, global_step=749]Traceback (most recent call last):
File "/Open-Sora-v1.2-2.88B/scripts/train.py", line 551, in
main()
File "/Open-Sora-v1.2-2.88B/scripts/train.py", line 371, in main
booster.backward(loss=loss, optimizer=optimizer)
File "/root/miniconda3/envs/opensora/lib/python3.9/site-packages/colossalai/booster/booster.py", line 176, in backward
optimizer.backward(loss)
File "/root/miniconda3/envs/opensora/lib/python3.9/site-packages/colossalai/zero/low_level/low_level_optim.py", line 536, in backward
loss.backward(retain_graph=retain_graph)
File "/root/miniconda3/envs/opensora/lib/python3.9/site-packages/torch/_tensor.py", line 522, in backward
torch.autograd.backward(
File "/root/miniconda3/envs/opensora/lib/python3.9/site-packages/torch/autograd/init.py", line 266, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/root/miniconda3/envs/opensora/lib/python3.9/site-packages/colossalai/zero/low_level/low_level_optim.py", line 272, in grad_handler
LowLevelZeroOptimizer.add_to_bucket(param, group_id, bucket_store, param_store, grad_store)
File "/root/miniconda3/envs/opensora/lib/python3.9/site-packages/colossalai/zero/low_level/low_level_optim.py", line 519, in add_to_bucket
LowLevelZeroOptimizer.run_reduction(bucket_store, grad_store)
File "/root/miniconda3/envs/opensora/lib/python3.9/site-packages/colossalai/zero/low_level/low_level_optim.py", line 297, in run_reduction
bucket_store.build_grad_in_bucket()
File "/root/miniconda3/envs/opensora/lib/python3.9/site-packages/colossalai/zero/low_level/bookkeeping/bucket_store.py", line 106, in build_grad_in_bucket
grad_current_rank = grad_list[rank].clone().detach()
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

[rank63]:[E ProcessGroupNCCL.cpp:1182] [Rank 63] NCCL watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f2f83f84d87 in /root/miniconda3/envs/opensora/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f2f83f3575f in /root/miniconda3/envs/opensora/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f2f840558a8 in /root/miniconda3/envs/opensora/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x6c (0x7f2f851283ac in /root/miniconda3/envs/opensora/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7f2f8512c4c8 in /root/miniconda3/envs/opensora/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x15a (0x7f2f8512fbfa in /root/miniconda3/envs/opensora/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7f2f85130839 in /root/miniconda3/envs/opensora/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0xd3b55 (0x7f2fcee5ab55 in /root/miniconda3/envs/opensora/bin/../lib/libstdc++.so.6)
frame #8: + 0x8609 (0x7f2fcffcb609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #9: clone + 0x43 (0x7f2fcfd96133 in /usr/lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
what(): [Rank 63] NCCL watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

期待您的回复!

Copy link

This issue is stale because it has been open for 7 days with no activity.

@github-actions github-actions bot added the stale label Sep 14, 2024
Copy link

This issue was closed because it has been inactive for 7 days since being marked as stale.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 21, 2024
@JonathanLi19
Copy link

遇到了同样的问题,请问你解决了吗?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants