Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Helper function all_gather_tensors_with_shapes() #3281

Conversation

sadra-barikbin
Copy link
Collaborator

No description provided.

@github-actions github-actions bot added the module: distributed Distributed module label Sep 4, 2024
ignite/distributed/utils.py Outdated Show resolved Hide resolved
@vfdev-5 vfdev-5 changed the title Helper function allgather_tensors_with_defferent_shapes() Helper function all_gather_tensors_with_shapes() Sep 4, 2024
ignite/distributed/utils.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@vfdev-5 vfdev-5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LTGM

ignite/distributed/utils.py Outdated Show resolved Hide resolved
ignite/distributed/utils.py Outdated Show resolved Hide resolved
ignite/distributed/utils.py Show resolved Hide resolved
sadra-barikbin and others added 4 commits September 4, 2024 19:35
…-tensors-with-different-shapes' into feature-allgather-tensors-with-different-shapes
ignite/distributed/utils.py Outdated Show resolved Hide resolved
ignite/distributed/utils.py Outdated Show resolved Hide resolved
ignite/distributed/utils.py Outdated Show resolved Hide resolved
if isinstance(_model, _SerialModel) or group == dist.GroupMember.NON_GROUP_MEMBER:
return [tensor]

max_shape = torch.tensor(shapes).amax(dim=0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether we could actually get tensor shapes using all_gather such that shapes arg can be optional ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we can. Do you want it in this PR?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Up to you, if you would like it in another PR, OK to me as well

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make it in another PR, I'll merge this one as CI is green

@vfdev-5 vfdev-5 merged commit 680ac7f into pytorch:master Sep 5, 2024
20 checks passed
@vfdev-5
Copy link
Collaborator

vfdev-5 commented Sep 9, 2024

@sadra-barikbin there is a related to this PR failure on HVD GPU CI:

[0]<stderr>:User function raise error: Padding length should be less than or equal to two times the input dimension but got padding length 6 and input of dimension 1Traceback (most recent call last):
[0]<stderr>:  File "<frozen runpy>", line 198, in _run_module_as_main
[0]<stderr>:  File "<frozen runpy>", line 88, in _run_code
[0]<stderr>:  File "/opt/conda/lib/python3.11/site-packages/horovod-0.28.1-py3.11-linux-x86_64.egg/horovod/runner/run_task.py", line 37, in <module>
[0]<stderr>:    main(driver_addr, run_func_server_port)
[0]<stderr>:  File "/opt/conda/lib/python3.11/site-packages/horovod-0.28.1-py3.11-linux-x86_64.egg/horovod/runner/run_task.py", line 28, in main
[0]<stderr>:    raise e
[0]<stderr>:  File "/opt/conda/lib/python3.11/site-packages/horovod-0.28.1-py3.11-linux-x86_64.egg/horovod/runner/run_task.py", line 25, in main
[0]<stderr>:    ret_val = func()
[0]<stderr>:              ^^^^^^
[0]<stderr>:  File "/opt/conda/lib/python3.11/site-packages/horovod-0.28.1-py3.11-linux-x86_64.egg/horovod/runner/__init__.py", line 215, in wrapped_func
[0]<stderr>:    return func(*args, **kwargs)
[0]<stderr>:           ^^^^^^^^^^^^^^^^^^^^^
[0]<stderr>:  File "/work/tests/ignite/conftest.py", line 370, in _hvd_task_with_init
[0]<stderr>:    func(*args)
[0]<stderr>:  File "/work/tests/ignite/distributed/utils/__init__.py", line 333, in _test_idist_all_gather_tensors_with_shapes_group
[0]<stderr>:    tensors = all_gather_tensors_with_shapes(rank_tensor, [[r + 1, r + 2, r + 3] for r in ranks], ranks)
[0]<stderr>:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[0]<stderr>:  File "/work/ignite/distributed/utils.py", line 395, in all_gather_tensors_with_shapes
[0]<stderr>:    padded_tensor = torch.nn.functional.pad(
[0]<stderr>:                    ^^^^^^^^^^^^^^^^^^^^^^^^
[0]<stderr>:  File "/opt/conda/lib/python3.11/site-packages/torch/nn/functional.py", line 4552, in pad
[0]<stderr>:    return torch._C._nn.pad(input, pad, mode, value)
[0]<stderr>:           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[0]<stderr>:RuntimeError: Padding length should be less than or equal to two times the input dimension but got padding length 6 and input of dimension 1

https://github.com/pytorch/ignite/actions/runs/10769410690/job/29860560970?pr=3283

Can you please check what happens?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: distributed Distributed module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants