-
-
Notifications
You must be signed in to change notification settings - Fork 611
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
distributed training compatible issue in ignite 0.4.2 #1307
Comments
@Nic-Ma thanks for the report ! Let me check it from my side to understand where is the problem. |
@Nic-Ma could you please provide more detail on your system: pytorch version and how it is built with NCCL ? I can not reproduce your issue with NCCL. In my case, I use python 1.6.0 with prebuilt NCCL
However, there can be another issue with the test which might be missing torch.cuda.set_device("cuda:{}".format(local_rank)) This can lead to hangs on collective ops like This problem is emphasized with 0.4.2 as it implicitly relies on on the fact that user sets cuda device per local rank. However, I'd like to improve this part of our code. |
Hi @vfdev-5 , Thanks for your quick help, after adding Thanks. |
@Nic-Ma yes, this should be definitely addressed in a clear way ! Thanks for the suggestion ! I'm still wondering which pytorch distribution you are using with NCCL 2.7.8 ? |
Hi @vfdev-5 , we used this docker: https://ngc.nvidia.com/catalog/containers/nvidia:pytorch |
Let's close this issue in favor of #1308 |
❓ Questions/Help/Support
Hi @vfdev-5 ,
I am trying to upgrade ignite to v0.4.2 in MONAI, got error when I ran this test program of MONAI:
https://github.com/Project-MONAI/MONAI/blob/master/tests/test_handler_rocauc_dist.py
I used 2 GPU in 1 node, and it passed in ignite v0.3.0 before.
Here is the error log:
Something wrong with my NCCL version & ignite v0.4.2?
Thanks.
The text was updated successfully, but these errors were encountered: