Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rocblas -> hipblas changes for ROCm #5401

Merged
merged 9 commits into from
May 17, 2024

Conversation

rraminen
Copy link
Contributor

@rraminen rraminen commented Apr 11, 2024

Fixes #4989

In addition to this PR, below changes are required to build below extensions successfully. Please note that not all unit tests for these extensions will pass with this PR. More details on the unit test results are below. These unit tests are skipped in CI anyway, so they will not break the CI.

Unit test results (rocm/pytorch:rocm6.1_ubuntu20.04_py3.9_pytorch_2.1.2) on MI200:

transformer_inference:
pytest --color=yes --durations=0 --verbose -s -m "inference_ops" -rF -n 4 unit/ops/transformer/inference

Before this PR:
==== 674 failed, 622 skipped, 8 warnings, 1728 errors in 123.66s (0:02:03) =====

After this PR:
========== 555 failed, 983 passed, 1486 skipped, 8 warnings in 14.35s ==========

quantizer:
pytest --color=yes --durations=0 --verbose -s -m "inference_ops" -rF -n 4 unit/ops/quantizer

Before this PR:
==== 244 failed, 8 warnings in 48.02s ====

After this PR:
===== 187 failed, 57 passed, 8 warnings in 14.74s ====

I could not find random_ltd related unit tests to run.

@rraminen rraminen marked this pull request as draft April 11, 2024 18:16
@rraminen rraminen mentioned this pull request Apr 18, 2024
@rraminen rraminen force-pushed the rocblas_to_hipblas_fix branch 2 times, most recently from 1309094 to bef085e Compare May 7, 2024 21:24
@rraminen rraminen marked this pull request as ready for review May 7, 2024 21:24
@rraminen rraminen force-pushed the rocblas_to_hipblas_fix branch 3 times, most recently from dda4bba to d288d36 Compare May 15, 2024 17:54
@jithunnair-amd
Copy link
Contributor

@rraminen Formatting checks failed with trailing whitespace error: https://github.com/microsoft/DeepSpeed/actions/runs/9115455800/job/25064328013?pr=5401#step:5:60

Should be a straightforward one, can you please check?

@loadams loadams enabled auto-merge May 17, 2024 01:09
@rraminen
Copy link
Contributor Author

Verified the extension builds in the following cases

rocm/pytorch:rocm6.1_ubuntu20.04_py3.9_pytorch_1.13.1
rocm/pytorch:rocm5.7_ubuntu20.04_py3.9_pytorch_2.0.1
rocm/pytorch:rocm6.1_ubuntu20.04_py3.9_pytorch_2.1.2

@loadams loadams added this pull request to the merge queue May 17, 2024
Merged via the queue into microsoft:master with commit d3dd8e7 May 17, 2024
13 checks passed
github-merge-queue bot pushed a commit that referenced this pull request May 17, 2024
This PR enables building the below extensions for AMD GPUs with warp
size 32.
- transformer_inference
- quantizer
- random_ltd


This PR works stand-alone for torch version <=2.0. For the latest
versions, #5401 is required
to be merged in addition to this PR.

Unit test results (rocm/pytorch:rocm6.1_ubuntu20.04_py3.9_pytorch_2.1.2)
on NAVI3x:

**transformer_inference:**
pytest --color=yes --durations=0 --verbose -s -m "inference_ops" -rF -n
4 unit/ops/transformer/inference

Before this PR:
===== 674 failed, 622 skipped, 8 warnings, 1728 errors in 69.37s
(0:01:09) =====

After this PR:
========== 476 failed, 1062 passed, 1486 skipped, 8 warnings in 9.31s
==========

**quantizer:**
pytest --color=yes --durations=0 --verbose -s -m "inference_ops" -rF -n
4 unit/ops/quantizer

Before this PR:
     ==== 244 failed, 8 warnings in 30.53s ====

After this PR:
    ====== 186 failed, 58 passed, 8 warnings in 8.89s ======

I could not find random_ltd related unit tests to run.

Fixes: 
#4753
#5474
ROCm#68

cc: @jithunnair-amd

---------

Co-authored-by: [email protected] <rraminen>
Co-authored-by: Logan Adams <[email protected]>
sfc-gh-reyazda pushed a commit to Snowflake-Labs/DeepSpeed that referenced this pull request Jun 10, 2024
Fixes microsoft#4989

In addition to this PR, below changes are required to build below
extensions successfully. Please note that not all unit tests for these
extensions will pass with this PR. More details on the unit test results
are below. These unit tests are skipped in CI anyway, so they will not
break the CI.
- transformer_inference
- quantizer
- random_ltd

- pytorch/pytorch#121030
- microsoft#5402


Unit test results (rocm/pytorch:rocm6.1_ubuntu20.04_py3.9_pytorch_2.1.2)
on MI200:

**transformer_inference:**
pytest --color=yes --durations=0 --verbose -s -m "inference_ops" -rF -n
4 unit/ops/transformer/inference

Before this PR: 
==== 674 failed, 622 skipped, 8 warnings, 1728 errors in 123.66s
(0:02:03) =====

After this PR:
========== 555 failed, 983 passed, 1486 skipped, 8 warnings in 14.35s
==========

**quantizer:**
pytest --color=yes --durations=0 --verbose -s -m "inference_ops" -rF -n
4 unit/ops/quantizer

Before this PR: 
==== 244 failed, 8 warnings in 48.02s ====

After this PR:
===== 187 failed, 57 passed, 8 warnings in 14.74s ====

I could not find random_ltd related unit tests to run.

---------

Co-authored-by: Logan Adams <[email protected]>
Co-authored-by: Logan Adams <[email protected]>
sfc-gh-reyazda pushed a commit to Snowflake-Labs/DeepSpeed that referenced this pull request Jun 10, 2024
This PR enables building the below extensions for AMD GPUs with warp
size 32.
- transformer_inference
- quantizer
- random_ltd


This PR works stand-alone for torch version <=2.0. For the latest
versions, microsoft#5401 is required
to be merged in addition to this PR.

Unit test results (rocm/pytorch:rocm6.1_ubuntu20.04_py3.9_pytorch_2.1.2)
on NAVI3x:

**transformer_inference:**
pytest --color=yes --durations=0 --verbose -s -m "inference_ops" -rF -n
4 unit/ops/transformer/inference

Before this PR:
===== 674 failed, 622 skipped, 8 warnings, 1728 errors in 69.37s
(0:01:09) =====

After this PR:
========== 476 failed, 1062 passed, 1486 skipped, 8 warnings in 9.31s
==========

**quantizer:**
pytest --color=yes --durations=0 --verbose -s -m "inference_ops" -rF -n
4 unit/ops/quantizer

Before this PR:
     ==== 244 failed, 8 warnings in 30.53s ====

After this PR:
    ====== 186 failed, 58 passed, 8 warnings in 8.89s ======

I could not find random_ltd related unit tests to run.

Fixes: 
microsoft#4753
microsoft#5474
ROCm#68

cc: @jithunnair-amd

---------

Co-authored-by: [email protected] <rraminen>
Co-authored-by: Logan Adams <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] ROCM Build DeepSpeed Inference failed
3 participants