-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Train] Don't use NCCL_BLOCKING_WAIT
#29562
Merged
Merged
Changes from all commits
Commits
Show all changes
7 commits
Select commit
Hold shift + click to select a range
d9773a1
remove
amogkam c1cf70b
update
amogkam 9ea5fbd
Merge branch 'master' of github.com:ray-project/ray into nccl-block
amogkam 8eccb64
add test
amogkam 9b2d047
time sleep
amogkam d59c0e6
update exceptio type
amogkam 2e46de2
update
amogkam File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we update the test plan for failure behavior ? iiuc documentation says
NCCL_ASYNC_ERROR_HANDLING
is more performant but crashes the process, butNCCL_BLOCKING_WAIT
will provide errors to the user which can be caught and handled --> this has implication of ray trainer's error handling semantics.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, we should trigger this code path and make sure the crash output provides enough information to the user before merging. I don't think we can do much better than crashing unfortunately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed we should do it. Any suggestions on how to trigger this code path? Couldn't think of an easy way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Launch data-parallel training (minimum two actors) that use NCCL to do the allreduce. Make one of the actors enter a while True: sleep loop so that it never enters the allreduce. Then, after 30 minutes, you'll see how PyTorch crashes the process. Will be even easier if you reduce the timeout ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, looks like an exception is being raised
But the Ray Actor is still alive, causing training to hang. @rkooo567 do you know why the actor is not terminating when receiving this exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the ray actor still alive? I think the process that contained the ray actor should be killed by
SIGABRT
https://github.com/ray-project/ray/blob/master/src/ray/util/logging.cc#L106There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes the actor is still alive. Not sure why the
std::abort()
is not being captured.Note, that the
std:abort()
is not being run in the main thread, but from what I understand, it should kill the entire process.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added test