Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PyTorch] Reorganize L1 tests #1255

Merged
merged 5 commits into from
Oct 18, 2024
Merged

Conversation

timmoon10
Copy link
Collaborator

Description

Our testing scheme currently looks like:

  • L0: run in all CI pipelines and with nightly builds
  • L1: monthly builds
  • L2: ?
  • L3: ?

I propose the following:

  • L0: run in all CI pipelines
  • L1: nightly builds
  • L2: weekly builds
  • L3: monthly builds

This PR consolidates the distributed PyTorch tests into L1 and moves the GPT convergence test to L3.

Note that we're not really following the standard testing nomenclature since most of our Python L0 tests should be classified as L1 or L2. This matches our workflow though, so I don't see a need to be pedantic.

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refractor
  • Testing

Changes

  • Consolidate PyTorch distributed tests within single L1 test script
  • Move PyTorch GPT convergence test to L3 test script

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@timmoon10 timmoon10 added the testing Improvements to tests or testing infrastructure label Oct 15, 2024
@timmoon10 timmoon10 requested a review from pggPL October 15, 2024 17:21
@pggPL
Copy link
Collaborator

pggPL commented Oct 16, 2024

LGTM

Can we add some way to run L1 tests from github?

@timmoon10
Copy link
Collaborator Author

Pipeline 19429427 seems to work as expected. The test failures don't seem related to this PR.

@timmoon10
Copy link
Collaborator Author

Pipeline 19461229

@timmoon10 timmoon10 merged commit 41fe1e5 into NVIDIA:main Oct 18, 2024
14 checks passed
@timmoon10 timmoon10 deleted the refactor-l1-tests branch October 18, 2024 01:58
timmoon10 added a commit to timmoon10/TransformerEngine that referenced this pull request Oct 18, 2024
Forgot to remove in NVIDIA#1255.

Signed-off-by: Tim Moon <[email protected]>
timmoon10 added a commit that referenced this pull request Oct 18, 2024
Remove PyTorch L0 distributed test

Forgot to remove in #1255.

Signed-off-by: Tim Moon <[email protected]>
timmoon10 added a commit that referenced this pull request Oct 18, 2024
* Reorganize PyTorch L1 tests

Signed-off-by: Tim Moon <[email protected]>

* Move ONNX tests to L1

Signed-off-by: Tim Moon <[email protected]>

* Move FA version test to L3

Signed-off-by: Tim Moon <[email protected]>

* Limit parallel build jobs in FA version test

Signed-off-by: Tim Moon <[email protected]>

---------

Signed-off-by: Tim Moon <[email protected]>
timmoon10 added a commit that referenced this pull request Oct 18, 2024
Remove PyTorch L0 distributed test

Forgot to remove in #1255.

Signed-off-by: Tim Moon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
testing Improvements to tests or testing infrastructure
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants