[Train] Add Intel HPU Support and Training Examples to Ray Train #43343

kira-lin · 2024-02-22T02:37:15Z

Why are these changes needed?

To leverage the potential of Intel's Habana Processing Units (HPUs), we extend Ray Train's capabilities by adding support for Intel HPU hardware. This update also includes training scripts for BERT and ResNet models, which are widely used in NLP and computer vision tasks, respectively.

Related issue number

Replace #42866

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: woshiyyya <[email protected]>

woshiyyya · 2024-02-23T21:51:04Z

@kira-lin Can you address my comments in your previous PR #42866?

kira-lin · 2024-02-28T08:10:23Z

@woshiyyya I've addressed all of them, please review.

add node configuration and tutorial Signed-off-by: Zhi Lin <[email protected]>

…mple Signed-off-by: Zhi Lin <[email protected]>

Signed-off-by: woshiyyya <[email protected]>

…_train_hpu_example

woshiyyya

@kira-lin Thanks for the hard work. The PR looks much better now.

I left some comments on the example and please take a look. It should be good to go after we polished the wordings in the examples.

doc/source/train/examples/hpu/resnet.ipynb

doc/source/train/examples/hpu/bert.ipynb

kira-lin · 2024-02-29T03:03:17Z

@woshiyyya addressed

Signed-off-by: Zhi Lin <[email protected]>

woshiyyya

LGTM. Thanks for the contribution!

@kira-lin We will put the examples into Community Example section. In the future, if there is a user issue related to these examples, I'll hand it over to you.

Signed-off-by: woshiyyya <[email protected]>

matthewdeng · 2024-03-07T18:49:56Z

python/ray/train/torch/config.py

+if HPU_PACKAGE_AVAILABLE:
+    import habana_frameworks.torch.core as htcore  # noqa: F401
+    import habana_frameworks.torch.distributed.hccl as hpu_dist  # noqa: F401


Should this import logic be moved to within _setup_torch_process_group and only called when backend == "hccl"? Seems like this is needed to be called on the workers.

@kira-lin can you push the change to the PR?

sorry, forgot 🥲 pushed

python/ray/train/torch/config.py

matthewdeng · 2024-03-12T02:09:25Z

doc/source/train/examples.rst

Looks like there is a merge conflict for this file, could you take a look at resolving it?

doc/source/train/examples.yml

Co-authored-by: matthewdeng <[email protected]> Signed-off-by: Zhi Lin <[email protected]>

Signed-off-by: Zhi Lin <[email protected]>

…mple Signed-off-by: Zhi Lin <[email protected]>

Signed-off-by: Zhi Lin <[email protected]>

Co-authored-by: Yunxuan Xiao <[email protected]> Signed-off-by: Zhi Lin <[email protected]>

matthewdeng · 2024-03-12T21:37:49Z

doc/source/train/examples.yml

Looks like the doc build is failing. Would you be able to take a look? I can also take a look later if needed.

sorry, not sure how to fix it

liuxsh9 · 2024-03-15T06:39:07Z

It's great to see that Ray Train has successfully extended its support beyond GPUs to other acceleration hardware. Interestingly, we are also preparing to contribute support for Huawei NPU. The adaptation process for torch_npu is very similar to torch_hpu.
However, we are concerned that following the HPU approach to add NPU support might make the code more complex and introduce scattered device type and library checks.
Therefore, in terms of design, we would like to separate the hardware-specific components, similar to the abstraction of accelerators, to make it easier for Ray Train (even RLlib) to support third-party devices.
We have already made some designs that ensure it won't affect the usage of GPUs and HPUs, and we provide references for integrating third-party devices with APIs like prepare_model and prepare_data_loader. We would love to hear the community's opinions on this matter.
Should we initiate a PR to showcase the implementation (including some code migration for HPU), or would it be better to initiate an issue first to discuss this matter? @kira-lin @woshiyyya @matthewdeng

woshiyyya · 2024-03-15T09:43:07Z

Hi @liuxsh9 , you are welcome to add support for NPU accelerator in Ray Train: ) If you already have the designs, feel free to post a PR and we can discuss the implementation details.

anyscalesam · 2024-03-25T19:24:10Z

@angelinalg can we pick this into latest; I think it might have just missed the branch cut date so right now the "latest" URL renders to a 404.

Switching to master works fine though.

https://docs.ray.io/en/latest/train/examples/hpu/resnet.html
https://docs.ray.io/en/master/train/examples/hpu/resnet.html

anyscalesam · 2024-03-26T16:41:14Z

@liuxsh9 we should also have the discussion at the Core layer first as the foundational interface to custom chipsets start with Core with Ray Libs building on top of that.

Let's connect online Xiaoshuang.

zhangjian94cn and others added 3 commits January 31, 2024 08:30

add bert and resnet example

7c7d3f3

enable hpu backend

2fee6d5

fix lint and add community example to doc page

560023c

Signed-off-by: woshiyyya <[email protected]>

kira-lin requested review from richardliaw, krfricke, xwjiang2010, amogkam, matthewdeng, Yard1, maxpumperla, justinvyu, woshiyyya and a team as code owners February 22, 2024 02:37

kira-lin added 2 commits February 28, 2024 16:08

add outputs to notebook;

493af57

add node configuration and tutorial Signed-off-by: Zhi Lin <[email protected]>

Merge remote-tracking branch 'upstream/master' into ray_train_hpu_exa…

f5b0aba

…mple Signed-off-by: Zhi Lin <[email protected]>

anyscalesam added the train Ray Train Related Issue label Feb 28, 2024

woshiyyya self-assigned this Feb 28, 2024

woshiyyya and others added 3 commits February 28, 2024 16:12

fix doc

942a44c

Signed-off-by: woshiyyya <[email protected]>

Merge remote-tracking branch 'kiralin/ray_train_hpu_example' into ray…

3ad4e8e

…_train_hpu_example

Merge branch 'master' into ray_train_hpu_example

829f4c2

woshiyyya requested changes Feb 29, 2024

View reviewed changes

kira-lin added 2 commits February 29, 2024 11:00

add more comments

0015698

Signed-off-by: Zhi Lin <[email protected]>

mention config in comments

9b1f62c

Signed-off-by: Zhi Lin <[email protected]>

woshiyyya approved these changes Mar 4, 2024

View reviewed changes

woshiyyya and others added 2 commits March 4, 2024 13:00

Merge branch 'master' into ray_train_hpu_example

8806632

add orphan tag to pass the doc build

fa1251c

Signed-off-by: woshiyyya <[email protected]>

woshiyyya assigned matthewdeng Mar 7, 2024

matthewdeng reviewed Mar 7, 2024

View reviewed changes

Merge branch 'master' into ray_train_hpu_example

9dd5d1e

matthewdeng approved these changes Mar 12, 2024

View reviewed changes

python/ray/train/torch/config.py Outdated Show resolved Hide resolved

matthewdeng reviewed Mar 12, 2024

View reviewed changes

woshiyyya requested changes Mar 12, 2024

View reviewed changes

doc/source/train/examples.yml Outdated Show resolved Hide resolved

doc/source/train/examples.yml Outdated Show resolved Hide resolved

doc/source/train/examples.yml Outdated Show resolved Hide resolved

doc/source/train/examples.yml Outdated Show resolved Hide resolved

kira-lin and others added 5 commits March 12, 2024 11:46

Update python/ray/train/torch/config.py

c852989

Co-authored-by: matthewdeng <[email protected]> Signed-off-by: Zhi Lin <[email protected]>

move import

c310a3f

Signed-off-by: Zhi Lin <[email protected]>

Merge remote-tracking branch 'upstream/master' into ray_train_hpu_exa…

6288ea3

…mple Signed-off-by: Zhi Lin <[email protected]>

add link for hpu examples

de38e49

Signed-off-by: Zhi Lin <[email protected]>

Apply suggestions from code review

9744f25

Co-authored-by: Yunxuan Xiao <[email protected]> Signed-off-by: Zhi Lin <[email protected]>

matthewdeng reviewed Mar 12, 2024

View reviewed changes

add quotes to handle []

3495df8

woshiyyya approved these changes Mar 13, 2024

View reviewed changes

matthewdeng merged commit 55ee2a2 into ray-project:master Mar 13, 2024
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Train] Add Intel HPU Support and Training Examples to Ray Train #43343

[Train] Add Intel HPU Support and Training Examples to Ray Train #43343

kira-lin commented Feb 22, 2024

woshiyyya commented Feb 23, 2024

kira-lin commented Feb 28, 2024

woshiyyya left a comment

kira-lin commented Feb 29, 2024

woshiyyya left a comment

matthewdeng Mar 7, 2024

kira-lin Mar 11, 2024

matthewdeng Mar 11, 2024

kira-lin Mar 12, 2024

matthewdeng Mar 12, 2024

kira-lin Mar 12, 2024

matthewdeng Mar 12, 2024

kira-lin Mar 13, 2024

liuxsh9 commented Mar 15, 2024

woshiyyya commented Mar 15, 2024

anyscalesam commented Mar 25, 2024

anyscalesam commented Mar 26, 2024

[Train] Add Intel HPU Support and Training Examples to Ray Train #43343

[Train] Add Intel HPU Support and Training Examples to Ray Train #43343

Conversation

kira-lin commented Feb 22, 2024

Why are these changes needed?

Related issue number

Checks

woshiyyya commented Feb 23, 2024

kira-lin commented Feb 28, 2024

woshiyyya left a comment

Choose a reason for hiding this comment

kira-lin commented Feb 29, 2024

woshiyyya left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liuxsh9 commented Mar 15, 2024

woshiyyya commented Mar 15, 2024

anyscalesam commented Mar 25, 2024

anyscalesam commented Mar 26, 2024