distributed training compatible issue in ignite 0.4.2 #1307

Nic-Ma · 2020-09-21T16:33:22Z

❓ Questions/Help/Support

I am trying to upgrade ignite to v0.4.2 in MONAI, got error when I ran this test program of MONAI:
https://github.com/Project-MONAI/MONAI/blob/master/tests/test_handler_rocauc_dist.py
I used 2 GPU in 1 node, and it passed in ignite v0.3.0 before.
Here is the error log:

root@apt-sh-ai:/workspace/data/medical/MONAI# python -m torch.distributed.launch --nproc_per_node=2 --nnodes=1 --node_rank=0 --master_addr="10.23.137.29" --master_port=1234 tests/test_handler_rocauc_dist.py
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
Traceback (most recent call last):
Traceback (most recent call last):
  File "tests/test_handler_rocauc_dist.py", line 48, in <module>
  File "tests/test_handler_rocauc_dist.py", line 48, in <module>
    main()
    main()
  File "tests/test_handler_rocauc_dist.py", line 23, in main
  File "tests/test_handler_rocauc_dist.py", line 23, in main
    auc_metric = ROCAUC(to_onehot_y=True, softmax=True)
    auc_metric = ROCAUC(to_onehot_y=True, softmax=True)
  File "/workspace/data/medical/MONAI/monai/handlers/roc_auc.py", line 66, in __init__
  File "/workspace/data/medical/MONAI/monai/handlers/roc_auc.py", line 66, in __init__
    super().__init__(output_transform, device=device)
  File "/opt/conda/lib/python3.6/site-packages/ignite/metrics/metric.py", line 200, in __init__
    super().__init__(output_transform, device=device)
  File "/opt/conda/lib/python3.6/site-packages/ignite/metrics/metric.py", line 200, in __init__
    if idist.get_world_size() > 1:
  File "/opt/conda/lib/python3.6/site-packages/ignite/distributed/utils.py", line 133, in get_world_size
    if idist.get_world_size() > 1:
  File "/opt/conda/lib/python3.6/site-packages/ignite/distributed/utils.py", line 133, in get_world_size
    sync(temporary=True)
  File "/opt/conda/lib/python3.6/site-packages/ignite/distributed/utils.py", line 64, in sync
    sync(temporary=True)
  File "/opt/conda/lib/python3.6/site-packages/ignite/distributed/utils.py", line 64, in sync
    model = comp_model_cls.create_from_context()
  File "/opt/conda/lib/python3.6/site-packages/ignite/distributed/comp_models/native.py", line 48, in create_from_context
    model = comp_model_cls.create_from_context()
  File "/opt/conda/lib/python3.6/site-packages/ignite/distributed/comp_models/native.py", line 48, in create_from_context
    return _NativeDistModel()
  File "/opt/conda/lib/python3.6/site-packages/ignite/distributed/comp_models/native.py", line 64, in __init__
    return _NativeDistModel()
  File "/opt/conda/lib/python3.6/site-packages/ignite/distributed/comp_models/native.py", line 64, in __init__
    self._init_from_context()
  File "/opt/conda/lib/python3.6/site-packages/ignite/distributed/comp_models/native.py", line 97, in _init_from_context
    self._init_from_context()
  File "/opt/conda/lib/python3.6/site-packages/ignite/distributed/comp_models/native.py", line 97, in _init_from_context
    self._setup_attrs()
  File "/opt/conda/lib/python3.6/site-packages/ignite/distributed/comp_models/base.py", line 26, in _setup_attrs
    self._setup_attrs()
    self._nproc_per_node = self._compute_nproc_per_node() if self.get_world_size() > 1 else 1
  File "/opt/conda/lib/python3.6/site-packages/ignite/distributed/comp_models/base.py", line 26, in _setup_attrs
  File "/opt/conda/lib/python3.6/site-packages/ignite/distributed/comp_models/native.py", line 101, in _compute_nproc_per_node
    self._nproc_per_node = self._compute_nproc_per_node() if self.get_world_size() > 1 else 1
  File "/opt/conda/lib/python3.6/site-packages/ignite/distributed/comp_models/native.py", line 101, in _compute_nproc_per_node
    dist.all_reduce(tensor, op=dist.ReduceOp.MAX)
  File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 938, in all_reduce
    dist.all_reduce(tensor, op=dist.ReduceOp.MAX)
  File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 938, in all_reduce
    work = _default_pg.allreduce([tensor], opts)
RuntimeError: NCCL error in: ../torch/lib/c10d/ProcessGroupNCCL.cpp:558, invalid usage, NCCL version 2.7.8
    work = _default_pg.allreduce([tensor], opts)
RuntimeError: NCCL error in: ../torch/lib/c10d/ProcessGroupNCCL.cpp:558, invalid usage, NCCL version 2.7.8
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.6/site-packages/torch/distributed/launch.py", line 261, in <module>
    main()
  File "/opt/conda/lib/python3.6/site-packages/torch/distributed/launch.py", line 257, in main
    cmd=cmd)
subprocess.CalledProcessError: Command '['/opt/conda/bin/python', '-u', 'tests/test_handler_rocauc_dist.py', '--local_rank=1']' returned non-zero exit status 1.

Something wrong with my NCCL version & ignite v0.4.2?

Thanks.

The text was updated successfully, but these errors were encountered:

vfdev-5 · 2020-09-21T16:47:07Z

@Nic-Ma thanks for the report ! Let me check it from my side to understand where is the problem.

vfdev-5 · 2020-09-21T20:49:21Z

@Nic-Ma could you please provide more detail on your system: pytorch version and how it is built with NCCL ?

I can not reproduce your issue with NCCL. In my case, I use python 1.6.0 with prebuilt NCCL

torch.cuda.nccl.version()
> 2408

However, there can be another issue with the test which might be missing

torch.cuda.set_device("cuda:{}".format(local_rank))

This can lead to hangs on collective ops like all_gather. See https://pytorch.org/docs/stable/distributed.html#launch-utility (Important Notices).

This problem is emphasized with 0.4.2 as it implicitly relies on on the fact that user sets cuda device per local rank. However, I'd like to improve this part of our code.

Nic-Ma · 2020-09-22T00:04:07Z

Hi @vfdev-5 ,

Thanks for your quick help, after adding torch.cuda.set_device, issue solved.
I think maybe you can add some explicit warning if missing this setting as 0.4.2 implicitly relies on this setting.

Thanks.

vfdev-5 · 2020-09-22T07:48:54Z

@Nic-Ma yes, this should be definitely addressed in a clear way ! Thanks for the suggestion !

I'm still wondering which pytorch distribution you are using with NCCL 2.7.8 ?
I think, it would be helpful for us to use it inside a CI.

Nic-Ma · 2020-09-22T08:36:07Z

Hi @vfdev-5 , we used this docker: https://ngc.nvidia.com/catalog/containers/nvidia:pytorch
I think it's PyTorch 1.7a.
Thanks.

vfdev-5 · 2020-09-22T16:01:30Z

Let's close this issue in favor of #1308

Nic-Ma added the question label Sep 21, 2020

Nic-Ma mentioned this issue Sep 21, 2020

1052 Upgrade Ignite dependency to 0.4.2 Project-MONAI/MONAI#1053

Merged

6 tasks

vfdev-5 mentioned this issue Sep 22, 2020

Show a warning if current device index is lower than current local rank #1308

Closed

vfdev-5 closed this as completed Sep 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

distributed training compatible issue in ignite 0.4.2 #1307

distributed training compatible issue in ignite 0.4.2 #1307

Nic-Ma commented Sep 21, 2020 •

edited

Loading

vfdev-5 commented Sep 21, 2020

vfdev-5 commented Sep 21, 2020

Nic-Ma commented Sep 22, 2020 •

edited

Loading

vfdev-5 commented Sep 22, 2020

Nic-Ma commented Sep 22, 2020

vfdev-5 commented Sep 22, 2020

distributed training compatible issue in ignite 0.4.2 #1307

distributed training compatible issue in ignite 0.4.2 #1307

Comments

Nic-Ma commented Sep 21, 2020 • edited Loading

❓ Questions/Help/Support

vfdev-5 commented Sep 21, 2020

vfdev-5 commented Sep 21, 2020

Nic-Ma commented Sep 22, 2020 • edited Loading

vfdev-5 commented Sep 22, 2020

Nic-Ma commented Sep 22, 2020

vfdev-5 commented Sep 22, 2020

Nic-Ma commented Sep 21, 2020 •

edited

Loading

Nic-Ma commented Sep 22, 2020 •

edited

Loading