Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Train] Add Intel HPU Support and Training Examples to Ray Train #43343

Merged
merged 19 commits into from
Mar 13, 2024

Conversation

kira-lin
Copy link
Contributor

Why are these changes needed?

To leverage the potential of Intel's Habana Processing Units (HPUs), we extend Ray Train's capabilities by adding support for Intel HPU hardware. This update also includes training scripts for BERT and ResNet models, which are widely used in NLP and computer vision tasks, respectively.

Related issue number

Replace #42866

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@woshiyyya
Copy link
Member

@kira-lin Can you address my comments in your previous PR #42866?

@kira-lin
Copy link
Contributor Author

@woshiyyya I've addressed all of them, please review.

add node configuration and tutorial

Signed-off-by: Zhi Lin <[email protected]>
@anyscalesam anyscalesam added the train Ray Train Related Issue label Feb 28, 2024
@woshiyyya woshiyyya self-assigned this Feb 28, 2024
Copy link
Member

@woshiyyya woshiyyya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kira-lin Thanks for the hard work. The PR looks much better now.

I left some comments on the example and please take a look. It should be good to go after we polished the wordings in the examples.

doc/source/train/examples/hpu/resnet.ipynb Show resolved Hide resolved
doc/source/train/examples/hpu/bert.ipynb Outdated Show resolved Hide resolved
doc/source/train/examples/hpu/bert.ipynb Outdated Show resolved Hide resolved
doc/source/train/examples/hpu/bert.ipynb Show resolved Hide resolved
doc/source/train/examples/hpu/bert.ipynb Outdated Show resolved Hide resolved
doc/source/train/examples/hpu/bert.ipynb Outdated Show resolved Hide resolved
@kira-lin
Copy link
Contributor Author

@woshiyyya addressed

Copy link
Member

@woshiyyya woshiyyya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for the contribution!

@kira-lin We will put the examples into Community Example section. In the future, if there is a user issue related to these examples, I'll hand it over to you.

Comment on lines 17 to 19
if HPU_PACKAGE_AVAILABLE:
import habana_frameworks.torch.core as htcore # noqa: F401
import habana_frameworks.torch.distributed.hccl as hpu_dist # noqa: F401
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this import logic be moved to within _setup_torch_process_group and only called when backend == "hccl"? Seems like this is needed to be called on the workers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kira-lin can you push the change to the PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, forgot 🥲 pushed

python/ray/train/torch/config.py Outdated Show resolved Hide resolved
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like there is a merge conflict for this file, could you take a look at resolving it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved

doc/source/train/examples.yml Outdated Show resolved Hide resolved
doc/source/train/examples.yml Outdated Show resolved Hide resolved
doc/source/train/examples.yml Outdated Show resolved Hide resolved
doc/source/train/examples.yml Outdated Show resolved Hide resolved
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the doc build is failing. Would you be able to take a look? I can also take a look later if needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, not sure how to fix it

@matthewdeng matthewdeng merged commit 55ee2a2 into ray-project:master Mar 13, 2024
9 checks passed
@liuxsh9
Copy link
Contributor

liuxsh9 commented Mar 15, 2024

It's great to see that Ray Train has successfully extended its support beyond GPUs to other acceleration hardware. Interestingly, we are also preparing to contribute support for Huawei NPU. The adaptation process for torch_npu is very similar to torch_hpu.
However, we are concerned that following the HPU approach to add NPU support might make the code more complex and introduce scattered device type and library checks.
Therefore, in terms of design, we would like to separate the hardware-specific components, similar to the abstraction of accelerators, to make it easier for Ray Train (even RLlib) to support third-party devices.
We have already made some designs that ensure it won't affect the usage of GPUs and HPUs, and we provide references for integrating third-party devices with APIs like prepare_model and prepare_data_loader. We would love to hear the community's opinions on this matter.
Should we initiate a PR to showcase the implementation (including some code migration for HPU), or would it be better to initiate an issue first to discuss this matter? @kira-lin @woshiyyya @matthewdeng

@woshiyyya
Copy link
Member

Hi @liuxsh9 , you are welcome to add support for NPU accelerator in Ray Train: ) If you already have the designs, feel free to post a PR and we can discuss the implementation details.

@anyscalesam
Copy link
Contributor

@angelinalg can we pick this into latest; I think it might have just missed the branch cut date so right now the "latest" URL renders to a 404.

Switching to master works fine though.

https://docs.ray.io/en/latest/train/examples/hpu/resnet.html
https://docs.ray.io/en/master/train/examples/hpu/resnet.html

@anyscalesam
Copy link
Contributor

@liuxsh9 we should also have the discussion at the Core layer first as the foundational interface to custom chipsets start with Core with Ray Libs building on top of that.

Let's connect online Xiaoshuang.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
train Ray Train Related Issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants