-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix NCCL test hang #9367
Fix NCCL test hang #9367
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can use the per-thread stream here as a future improvement.
* because NCCL kernels are blocking. The extra CUDA stream synchronization makes sure that the | ||
* NCCL kernels are caught up, thus avoiding the deadlock. | ||
*/ | ||
explicit NcclDeviceCommunicator(int device_ordinal, bool needs_sync = false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we avoid the default parameter? Would be better if we have a clear reference to who's asking for sync and who is not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
You mean per-thread default stream? I'm not sure it'll help here. The issue seems to be creating multiple NCCL communicators in the same process and mixing NCCL kernels with ours. |
Thank you for the quick fix! |
The NCCL communicator tests sometimes hang because NCCL kernels are blocking, and mixing with
LaunchN
calls might cause a deadlock. Synchronizing on the stream beforeLaunchN
seems to fix this issue.