[C/PyTorch] Fixed incorrect use of `torch.distributed.new_group()` when creating intra-node group in `initialize_ub()` #1087

denera · 2024-08-07T19:17:19Z

Description

#994 reported that initialize_ub() tries to create the intra-node process group with only the node ranks calling into torch.distributed.new_group() even though PyTorch requires all ranks to call into this even if they're not part of the new group.

This PR fixes the issue by creating the intra-node group via torch.distributed.new_subgroups_by_enumeration() instead. It also expands the unit test coverage over TE layers with comm+GEMM overlap and previously untested atomic GEMM overlaps.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refractor

Changes

Re-implemented Userbuffers intra-node process group with torch.distributed.new_subgroups_by_enumeration().
Added new unit tests for TE modules with comm+GEMM overlap.
Fixed a GEMM FP8 output type mistake in comm+GEMM overlap algorithms.
Updated existing comm+GEMM overlap tests to cover missing atomic GEMM overlaps and GEMM FP8 outputs.

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

…rate intra-node groups, added new unit tests for TE layers with comm overlap Signed-off-by: Alp Dener <[email protected]>

for more information, see https://pre-commit.ci

timmoon10

Seems reasonable. It looks like some of these changes are also in #1067, so it might be better to merge that first.

denera · 2024-08-09T15:25:44Z

Seems reasonable. It looks like some of these changes are also in #1067, so it might be better to merge that first.

#1067 needs a bit more attention to validate with MLPerf benchmarks, so I'll merge this first and then rebase 1067.

updated initialize_ub() to use new_subgroups_by_enumeration() to gene…

daf2226

…rate intra-node groups, added new unit tests for TE layers with comm overlap Signed-off-by: Alp Dener <[email protected]>

denera added the bug Something isn't working label Aug 7, 2024

denera requested review from timmoon10, ptrendx and ksivaman August 7, 2024 19:17

denera self-assigned this Aug 7, 2024

[pre-commit.ci] auto fixes from pre-commit.com hooks

3ceed63

for more information, see https://pre-commit.ci

liuhatry mentioned this pull request Aug 8, 2024

tp_overlap init failed when tp_size != world_size #994

Closed

timmoon10 approved these changes Aug 8, 2024

View reviewed changes

denera merged commit fa4b866 into NVIDIA:main Aug 9, 2024
2 checks passed

denera mentioned this pull request Aug 16, 2024

create_communicator_grouped2 may trigger uninit value memory issue(randomly crash) when you train more iterations. #959

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C/PyTorch] Fixed incorrect use of `torch.distributed.new_group()` when creating intra-node group in `initialize_ub()` #1087

[C/PyTorch] Fixed incorrect use of `torch.distributed.new_group()` when creating intra-node group in `initialize_ub()` #1087

denera commented Aug 7, 2024

timmoon10 left a comment

denera commented Aug 9, 2024

[C/PyTorch] Fixed incorrect use of torch.distributed.new_group() when creating intra-node group in initialize_ub() #1087

[C/PyTorch] Fixed incorrect use of torch.distributed.new_group() when creating intra-node group in initialize_ub() #1087

Conversation

denera commented Aug 7, 2024

Description

Type of change

Changes

Checklist:

timmoon10 left a comment

Choose a reason for hiding this comment

denera commented Aug 9, 2024

[C/PyTorch] Fixed incorrect use of `torch.distributed.new_group()` when creating intra-node group in `initialize_ub()` #1087

[C/PyTorch] Fixed incorrect use of `torch.distributed.new_group()` when creating intra-node group in `initialize_ub()` #1087