Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C/PyTorch] Fixed incorrect use of torch.distributed.new_group() when creating intra-node group in initialize_ub() #1087

Merged
merged 2 commits into from
Aug 9, 2024

Conversation

denera
Copy link
Collaborator

@denera denera commented Aug 7, 2024

Description

#994 reported that initialize_ub() tries to create the intra-node process group with only the node ranks calling into torch.distributed.new_group() even though PyTorch requires all ranks to call into this even if they're not part of the new group.

This PR fixes the issue by creating the intra-node group via torch.distributed.new_subgroups_by_enumeration() instead. It also expands the unit test coverage over TE layers with comm+GEMM overlap and previously untested atomic GEMM overlaps.

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refractor

Changes

  • Re-implemented Userbuffers intra-node process group with torch.distributed.new_subgroups_by_enumeration().
  • Added new unit tests for TE modules with comm+GEMM overlap.
  • Fixed a GEMM FP8 output type mistake in comm+GEMM overlap algorithms.
  • Updated existing comm+GEMM overlap tests to cover missing atomic GEMM overlaps and GEMM FP8 outputs.

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

…rate intra-node groups, added new unit tests for TE layers with comm overlap

Signed-off-by: Alp Dener <[email protected]>
@denera denera added the bug Something isn't working label Aug 7, 2024
@denera denera self-assigned this Aug 7, 2024
Copy link
Collaborator

@timmoon10 timmoon10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable. It looks like some of these changes are also in #1067, so it might be better to merge that first.

@denera
Copy link
Collaborator Author

denera commented Aug 9, 2024

Seems reasonable. It looks like some of these changes are also in #1067, so it might be better to merge that first.

#1067 needs a bit more attention to validate with MLPerf benchmarks, so I'll merge this first and then rebase 1067.

@denera denera merged commit fa4b866 into NVIDIA:main Aug 9, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants