Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable float8 CI on sm89 #587

Merged
merged 29 commits into from
Aug 7, 2024
Merged

Enable float8 CI on sm89 #587

merged 29 commits into from
Aug 7, 2024

Conversation

jainapurva
Copy link
Contributor

@jainapurva jainapurva commented Aug 1, 2024

The current CI/CD pipeline skipped float8 tests as they weren't compatible with the A10G GPUs. This PR adds a new CI job which runs on NVIDIA L4 Tensor Core GPUs for float8 tests.

Fixes Issue: #575

Test Plan : Compare the logs of Regression Test and Float8 Test to check for the skipped tests in Regression Test which are now being run on the new CI job of Float8 Test.

Sample of some test cases which were being skipped in CUDA Nightly in Regression Tests are running in Float8.

  • test/float8/test_numerics_integration.py::TestFloat8NumericsIntegrationTest::test_encoder_fw_bw
  • test/float8/test_inference_flows.py::TestHPTrainToFP8LinearInference::test_dynamic_fp8_mlp
  • test/float8/test_inference_flows.py::TestHPTrainToFP8LinearInference::test_static_fp8_mlp

Screenshot of logs of CUDA Nightly
image

Screenshot of logs of Float8 test
image

For full list of passed tests: https://gist.github.com/jainapurva/a5ddd17c809219151485de8d1708d078

Copy link

pytorch-bot bot commented Aug 1, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/587

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 01f0ba1 with merge base 013cce3 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 1, 2024
@jainapurva jainapurva marked this pull request as ready for review August 2, 2024 00:29
@msaroufim
Copy link
Member

msaroufim commented Aug 5, 2024

So question here

  1. Are these g6 machines cheaper than g5? Cause it sounds like count goes from 12 to 4. If it is cheaper we cna just use this machine by default instead and you won't need an extra job
  2. Also if the plan is launch an extra job for fp8 then the new job should only run fp8 tests, itll make your tests run a lot faster

EDIT: Ok that's a lot of test failures, mind opening up a tracker with them so we can assign them to right owners

  1. Sparsity issues will be fixed by @jcaip
  2. Affine quantized tensor by @jerryzh168

@msaroufim
Copy link
Member

msaroufim commented Aug 6, 2024

I noticed the fp8 test is not being triggered here so instead what I did was

  1. Copy pasted your updated regression test file
  2. Deleted all strategies except the one that needs to run the g6 machines
  3. Updated the test to only run fp8

So you can go ahead and make the updates here from this PR #603 I won't land 603

For Github Actions because you can't iterate fast because you need to wait on CI, would recommend copy pasting as much so you minimize surface area of missing tpos as possible or running tiny experiments where you in your branch temporarily delete all jobs except the one you're interested in iterating on. It makes iterating far more pleasant

Screenshot 2024-08-05 at 9 33 58 PM

Also regarding pricing if we compare g5.12x prices vs g6.4x prices https://aws.amazon.com/ec2/instance-types/g6/ and https://aws.amazon.com/ec2/instance-types/g5/

  • g5.12x: $5.672 per hour
  • g6.4x: $1.323 per hour

So you could reduce our CI costs by about 4x just changing the default machine type but we can do this in a future PR

msaroufim
msaroufim previously approved these changes Aug 6, 2024
@msaroufim msaroufim dismissed their stale review August 6, 2024 18:28

fat finger

@jainapurva jainapurva force-pushed the float8-on-sm89 branch 2 times, most recently from 26371be to 0d948fb Compare August 7, 2024 17:26
@jainapurva jainapurva merged commit 1cfe69e into main Aug 7, 2024
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants