-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Train] Don't use NCCL_BLOCKING_WAIT
#29562
Conversation
Signed-off-by: Amog Kamsetty <[email protected]>
) | ||
os.environ["NCCL_BLOCKING_WAIT"] = "1" | ||
os.environ["NCCL_ASYNC_ERROR_HANDLING"] = "1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we update the test plan for failure behavior ? iiuc documentation says NCCL_ASYNC_ERROR_HANDLING
is more performant but crashes the process, but NCCL_BLOCKING_WAIT
will provide errors to the user which can be caught and handled --> this has implication of ray trainer's error handling semantics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, we should trigger this code path and make sure the crash output provides enough information to the user before merging. I don't think we can do much better than crashing unfortunately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed we should do it. Any suggestions on how to trigger this code path? Couldn't think of an easy way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Launch data-parallel training (minimum two actors) that use NCCL to do the allreduce. Make one of the actors enter a while True: sleep loop so that it never enters the allreduce. Then, after 30 minutes, you'll see how PyTorch crashes the process. Will be even easier if you reduce the timeout ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, looks like an exception is being raised
(RayTrainWorker pid=13803) [E ProcessGroupNCCL.cpp:737] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, Timeout(ms)=5000) ran for 7751 milliseconds before timing out.
(RayTrainWorker pid=13803) [E ProcessGroupNCCL.cpp:414] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
(RayTrainWorker pid=13803) [2022-10-21 16:23:36,638 E 13803 13875] logging.cc:97: Unhandled exception: St13runtime_error. what(): [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, Timeout(ms)=5000) ran for 7751 milliseconds before timing out.
(RayTrainWorker pid=13803) [2022-10-21 16:23:36,648 E 13803 13875] logging.cc:104: Stack trace:
(RayTrainWorker pid=13803) /home/ray/anaconda3/lib/python3.8/site-packages/ray/_raylet.so(+0xc74dda) [0x7f0934867dda] ray::operator<<()
(RayTrainWorker pid=13803) /home/ray/anaconda3/lib/python3.8/site-packages/ray/_raylet.so(+0xc77598) [0x7f093486a598] ray::TerminateHandler()
(RayTrainWorker pid=13803) /home/ray/anaconda3/bin/../lib/libstdc++.so.6(+0xacf6f) [0x7f0933b2af6f] __cxxabiv1::__terminate()
(RayTrainWorker pid=13803) /home/ray/anaconda3/bin/../lib/libstdc++.so.6(+0xacfb1) [0x7f0933b2afb1] __cxxabiv1::__unexpected()
(RayTrainWorker pid=13803) /home/ray/anaconda3/bin/../lib/libstdc++.so.6(+0xacf6c) [0x7f0933b2af6c] __cxxabiv1::__terminate()
(RayTrainWorker pid=13803) /home/ray/anaconda3/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so(_ZN4c10d16ProcessGroupNCCL8WorkNCCL15handleNCCLGuardEv+0x19f) [0x7efbc5ae2d4f] c10d::ProcessGroupNCCL::WorkNCCL::handleNCCLGuard()
(RayTrainWorker pid=13803) /home/ray/anaconda3/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so(_ZN4c10d16ProcessGroupNCCL15workCleanupLoopEv+0x199) [0x7efbc5ae71c9] c10d::ProcessGroupNCCL::workCleanupLoop()
(RayTrainWorker pid=13803) /home/ray/anaconda3/bin/../lib/libstdc++.so.6(+0xc9039) [0x7f0933b47039] execute_native_thread_routine
(RayTrainWorker pid=13803) /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7f09354e5609] start_thread
(RayTrainWorker pid=13803) /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7f093540a133] __clone
(RayTrainWorker pid=13803)
But the Ray Actor is still alive, causing training to hang. @rkooo567 do you know why the actor is not terminating when receiving this exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the ray actor still alive? I think the process that contained the ray actor should be killed by SIGABRT
https://github.com/ray-project/ray/blob/master/src/ray/util/logging.cc#L106
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes the actor is still alive. Not sure why the std::abort()
is not being captured.
Note, that the std:abort()
is not being run in the main thread, but from what I understand, it should kill the entire process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added test
) | ||
os.environ["NCCL_BLOCKING_WAIT"] = "1" | ||
os.environ["NCCL_ASYNC_ERROR_HANDLING"] = "1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, we should trigger this code path and make sure the crash output provides enough information to the user before merging. I don't think we can do much better than crashing unfortunately.
Signed-off-by: Amog Kamsetty <[email protected]>
Blocked on #29576 |
Signed-off-by: amogkam <[email protected]>
python/ray/train/tests/test_gpu.py
Outdated
# NCCL should timeout. | ||
if session.get_world_rank() == 0: | ||
while True: | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can we have a time.sleep
here? that way we don't consume an entire CPU unnecessarily 🔥
Signed-off-by: amogkam <[email protected]>
hmm wait what's the status of #29576 don't we need it first before changing our recovery ? |
Signed-off-by: amogkam <[email protected]>
From the pytorch docs, we should use NCCL_ASYNC_ERROR_HANDLING instead. Signed-off-by: amogkam <[email protected]> Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Amog Kamsetty [email protected]
From the pytorch docs, we should use
NCCL_ASYNC_ERROR_HANDLING
instead.Closes #29419
Why are these changes needed?
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.