[C/PyTorch] Fixed incorrect use of torch.distributed.new_group()
when creating intra-node group in initialize_ub()
#1087
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
#994 reported that
initialize_ub()
tries to create the intra-node process group with only the node ranks calling intotorch.distributed.new_group()
even though PyTorch requires all ranks to call into this even if they're not part of the new group.This PR fixes the issue by creating the intra-node group via
torch.distributed.new_subgroups_by_enumeration()
instead. It also expands the unit test coverage over TE layers with comm+GEMM overlap and previously untested atomic GEMM overlaps.Type of change
Changes
torch.distributed.new_subgroups_by_enumeration()
.Checklist: