Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C/PyTorch] Userbuffers and comm+GEMM overlap algorithms refactored and moved to TE/common #1067

Open
wants to merge 34 commits into
base: main
Choose a base branch
from

Conversation

denera
Copy link
Collaborator

@denera denera commented Jul 31, 2024

Description

This PR moves Userbuffers and comm+GEMM overlap algorithms from TE/PyTorch to TE/common with refactored interfaces to remove the PyTorch dependency.

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refractor

Changes

  • transformer_engine/pytorch/csrc/userbuffers moved to transformer_engine/common/comm_gemm_overlap/userbuffers.
  • transformer_engine/pytorch/csrc/comm_gemm_overlap.h split into transformer_engine/common/include/transformer_engine/comm_gemm_overlap.h and transformer_engine/common/comm_gemm_overlap/comm_gemm_overlap.cpp and refactored to remove torch::Tensor dependency.
  • Added new TE/PyTorch wrappers around the refactored comm+GEMM overlap algorithms.
  • Expanded unit tests to cover all overlap algorithms including atomic GEMM overlaps (tested as AG+RS pairs).

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@denera denera force-pushed the comm-gemm-overlap-refactor branch from 7255ca5 to 7c0cc8d Compare July 31, 2024 18:44
@denera denera self-assigned this Jul 31, 2024
@denera denera added the enhancement New feature or request label Jul 31, 2024
transformer_engine/common/CMakeLists.txt Outdated Show resolved Hide resolved
transformer_engine/pytorch/module/layernorm_linear.py Outdated Show resolved Hide resolved
transformer_engine/common/util/pybind_helper.h Outdated Show resolved Hide resolved
@timmoon10 timmoon10 self-requested a review August 1, 2024 00:36
Copy link
Collaborator

@timmoon10 timmoon10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks pretty good. My suggestions are quibbles with the API.

@denera denera force-pushed the comm-gemm-overlap-refactor branch from 2e55bb2 to dd8cc21 Compare August 1, 2024 20:56
@denera denera force-pushed the comm-gemm-overlap-refactor branch 8 times, most recently from 4847133 to 10feff5 Compare August 28, 2024 23:14
denera and others added 23 commits October 24, 2024 17:46
…areable file handle send/recv

Signed-off-by: Alp Dener <[email protected]>
…ters so PyTorch can factor externally allocated memory into its garbage collection threshold

Signed-off-by: Alp Dener <[email protected]>
…mmOverlapHelper to simplify Python function signatures

Signed-off-by: Alp Dener <[email protected]>
…ceGetProp call with cached sm_count()

Signed-off-by: Alp Dener <[email protected]>
…ion type defines with aliases

Signed-off-by: Alp Dener <[email protected]>
Copy link
Collaborator

@timmoon10 timmoon10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. We can merge once we are passing CI.

docs/api/pytorch.rst Outdated Show resolved Hide resolved
@denera
Copy link
Collaborator Author

denera commented Oct 24, 2024

/te-ci

@denera denera requested a review from ksivaman October 24, 2024 20:48
Copy link
Member

@ksivaman ksivaman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants