[Train] Don't use `NCCL_BLOCKING_WAIT` #29562

amogkam · 2022-10-21T21:20:14Z

Signed-off-by: Amog Kamsetty [email protected]

From the pytorch docs, we should use NCCL_ASYNC_ERROR_HANDLING instead.

Closes #29419

Why are these changes needed?

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Amog Kamsetty <[email protected]>

jiaodong · 2022-10-21T21:31:37Z

python/ray/train/torch/config.py

        )
-        os.environ["NCCL_BLOCKING_WAIT"] = "1"
+        os.environ["NCCL_ASYNC_ERROR_HANDLING"] = "1"


can we update the test plan for failure behavior ? iiuc documentation says NCCL_ASYNC_ERROR_HANDLING is more performant but crashes the process, but NCCL_BLOCKING_WAIT will provide errors to the user which can be caught and handled --> this has implication of ray trainer's error handling semantics.

+1, we should trigger this code path and make sure the crash output provides enough information to the user before merging. I don't think we can do much better than crashing unfortunately.

Agreed we should do it. Any suggestions on how to trigger this code path? Couldn't think of an easy way.

Launch data-parallel training (minimum two actors) that use NCCL to do the allreduce. Make one of the actors enter a while True: sleep loop so that it never enters the allreduce. Then, after 30 minutes, you'll see how PyTorch crashes the process. Will be even easier if you reduce the timeout ;)

Yep, looks like an exception is being raised

(RayTrainWorker pid=13803) [E ProcessGroupNCCL.cpp:737] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, Timeout(ms)=5000) ran for 7751 milliseconds before timing out. (RayTrainWorker pid=13803) [E ProcessGroupNCCL.cpp:414] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down. (RayTrainWorker pid=13803) [2022-10-21 16:23:36,638 E 13803 13875] logging.cc:97: Unhandled exception: St13runtime_error. what(): [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, Timeout(ms)=5000) ran for 7751 milliseconds before timing out. (RayTrainWorker pid=13803) [2022-10-21 16:23:36,648 E 13803 13875] logging.cc:104: Stack trace: (RayTrainWorker pid=13803) /home/ray/anaconda3/lib/python3.8/site-packages/ray/_raylet.so(+0xc74dda) [0x7f0934867dda] ray::operator<<() (RayTrainWorker pid=13803) /home/ray/anaconda3/lib/python3.8/site-packages/ray/_raylet.so(+0xc77598) [0x7f093486a598] ray::TerminateHandler() (RayTrainWorker pid=13803) /home/ray/anaconda3/bin/../lib/libstdc++.so.6(+0xacf6f) [0x7f0933b2af6f] __cxxabiv1::__terminate() (RayTrainWorker pid=13803) /home/ray/anaconda3/bin/../lib/libstdc++.so.6(+0xacfb1) [0x7f0933b2afb1] __cxxabiv1::__unexpected() (RayTrainWorker pid=13803) /home/ray/anaconda3/bin/../lib/libstdc++.so.6(+0xacf6c) [0x7f0933b2af6c] __cxxabiv1::__terminate() (RayTrainWorker pid=13803) /home/ray/anaconda3/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so(_ZN4c10d16ProcessGroupNCCL8WorkNCCL15handleNCCLGuardEv+0x19f) [0x7efbc5ae2d4f] c10d::ProcessGroupNCCL::WorkNCCL::handleNCCLGuard() (RayTrainWorker pid=13803) /home/ray/anaconda3/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so(_ZN4c10d16ProcessGroupNCCL15workCleanupLoopEv+0x199) [0x7efbc5ae71c9] c10d::ProcessGroupNCCL::workCleanupLoop() (RayTrainWorker pid=13803) /home/ray/anaconda3/bin/../lib/libstdc++.so.6(+0xc9039) [0x7f0933b47039] execute_native_thread_routine (RayTrainWorker pid=13803) /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7f09354e5609] start_thread (RayTrainWorker pid=13803) /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7f093540a133] __clone (RayTrainWorker pid=13803)

But the Ray Actor is still alive, causing training to hang. @rkooo567 do you know why the actor is not terminating when receiving this exception?

Is the ray actor still alive? I think the process that contained the ray actor should be killed by SIGABRT https://github.com/ray-project/ray/blob/master/src/ray/util/logging.cc#L106

yes the actor is still alive. Not sure why the std::abort() is not being captured.

Note, that the std:abort() is not being run in the main thread, but from what I understand, it should kill the entire process.

python/ray/train/torch/config.py

cadedaniel · 2022-10-21T21:42:22Z

python/ray/train/torch/config.py

        )
-        os.environ["NCCL_BLOCKING_WAIT"] = "1"
+        os.environ["NCCL_ASYNC_ERROR_HANDLING"] = "1"


+1, we should trigger this code path and make sure the crash output provides enough information to the user before merging. I don't think we can do much better than crashing unfortunately.

Signed-off-by: Amog Kamsetty <[email protected]>

amogkam · 2022-10-22T01:01:10Z

Blocked on #29576

Signed-off-by: amogkam <[email protected]>

cadedaniel · 2022-11-16T00:33:43Z

python/ray/train/tests/test_gpu.py

+        # NCCL should timeout.
+        if session.get_world_rank() == 0:
+            while True:
+                pass


nit: can we have a time.sleep here? that way we don't consume an entire CPU unnecessarily 🔥

Signed-off-by: amogkam <[email protected]>

jiaodong · 2022-11-16T01:51:43Z

hmm wait what's the status of #29576 don't we need it first before changing our recovery ?

Signed-off-by: amogkam <[email protected]>

From the pytorch docs, we should use NCCL_ASYNC_ERROR_HANDLING instead. Signed-off-by: amogkam <[email protected]> Signed-off-by: Weichen Xu <[email protected]>

remove

d9773a1

Signed-off-by: Amog Kamsetty <[email protected]>

amogkam assigned cadedaniel, matthewdeng, richardliaw and jiaodong Oct 21, 2022

jiaodong reviewed Oct 21, 2022

View reviewed changes

cadedaniel reviewed Oct 21, 2022

View reviewed changes

update

c1cf70b

Signed-off-by: Amog Kamsetty <[email protected]>

amogkam mentioned this pull request Oct 24, 2022

[Core] NCCL_BLOCKING_WAIT=1 leads to performance regression #29419

Closed

cadedaniel mentioned this pull request Nov 9, 2022

[Core] SIGABRT causes Ray to hang #29576

Closed

amogkam added 2 commits November 15, 2022 14:14

Merge branch 'master' of github.com:ray-project/ray into nccl-block

9ea5fbd

add test

8eccb64

Signed-off-by: amogkam <[email protected]>

cadedaniel reviewed Nov 16, 2022

View reviewed changes

time sleep

9b2d047

Signed-off-by: amogkam <[email protected]>

cadedaniel approved these changes Nov 16, 2022

View reviewed changes

amogkam and others added 2 commits November 15, 2022 18:02

update exceptio type

d59c0e6

update

2e46de2

Signed-off-by: amogkam <[email protected]>

amogkam merged commit 2631806 into ray-project:master Nov 17, 2022

amogkam deleted the nccl-block branch November 17, 2022 18:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Train] Don't use `NCCL_BLOCKING_WAIT` #29562

[Train] Don't use `NCCL_BLOCKING_WAIT` #29562

amogkam commented Oct 21, 2022

jiaodong Oct 21, 2022

cadedaniel Oct 21, 2022

amogkam Oct 21, 2022 •

edited

Loading

cadedaniel Oct 21, 2022

amogkam Oct 21, 2022

cadedaniel Oct 22, 2022

amogkam Oct 22, 2022 •

edited

Loading

amogkam Nov 16, 2022 •

edited

Loading

cadedaniel Oct 21, 2022

amogkam commented Oct 22, 2022

cadedaniel Nov 16, 2022

jiaodong commented Nov 16, 2022

[Train] Don't use NCCL_BLOCKING_WAIT #29562

[Train] Don't use NCCL_BLOCKING_WAIT #29562

Conversation

amogkam commented Oct 21, 2022

Why are these changes needed?

Related issue number

Checks

jiaodong Oct 21, 2022

Choose a reason for hiding this comment

cadedaniel Oct 21, 2022

Choose a reason for hiding this comment

amogkam Oct 21, 2022 • edited Loading

Choose a reason for hiding this comment

cadedaniel Oct 21, 2022

Choose a reason for hiding this comment

amogkam Oct 21, 2022

Choose a reason for hiding this comment

cadedaniel Oct 22, 2022

Choose a reason for hiding this comment

amogkam Oct 22, 2022 • edited Loading

Choose a reason for hiding this comment

amogkam Nov 16, 2022 • edited Loading

Choose a reason for hiding this comment

cadedaniel Oct 21, 2022

Choose a reason for hiding this comment

amogkam commented Oct 22, 2022

cadedaniel Nov 16, 2022

Choose a reason for hiding this comment

jiaodong commented Nov 16, 2022

[Train] Don't use `NCCL_BLOCKING_WAIT` #29562

[Train] Don't use `NCCL_BLOCKING_WAIT` #29562

amogkam Oct 21, 2022 •

edited

Loading

amogkam Oct 22, 2022 •

edited

Loading

amogkam Nov 16, 2022 •

edited

Loading