-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Train] Add Intel HPU Support and Training Examples to Ray Train #43343
[Train] Add Intel HPU Support and Training Examples to Ray Train #43343
Conversation
Signed-off-by: woshiyyya <[email protected]>
@woshiyyya I've addressed all of them, please review. |
add node configuration and tutorial Signed-off-by: Zhi Lin <[email protected]>
…mple Signed-off-by: Zhi Lin <[email protected]>
Signed-off-by: woshiyyya <[email protected]>
…_train_hpu_example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kira-lin Thanks for the hard work. The PR looks much better now.
I left some comments on the example and please take a look. It should be good to go after we polished the wordings in the examples.
@woshiyyya addressed |
Signed-off-by: Zhi Lin <[email protected]>
Signed-off-by: Zhi Lin <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for the contribution!
@kira-lin We will put the examples into Community Example section. In the future, if there is a user issue related to these examples, I'll hand it over to you.
Signed-off-by: woshiyyya <[email protected]>
python/ray/train/torch/config.py
Outdated
if HPU_PACKAGE_AVAILABLE: | ||
import habana_frameworks.torch.core as htcore # noqa: F401 | ||
import habana_frameworks.torch.distributed.hccl as hpu_dist # noqa: F401 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this import logic be moved to within _setup_torch_process_group
and only called when backend == "hccl"
? Seems like this is needed to be called on the workers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kira-lin can you push the change to the PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, forgot 🥲 pushed
doc/source/train/examples.rst
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like there is a merge conflict for this file, could you take a look at resolving it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
resolved
Co-authored-by: matthewdeng <[email protected]> Signed-off-by: Zhi Lin <[email protected]>
Signed-off-by: Zhi Lin <[email protected]>
…mple Signed-off-by: Zhi Lin <[email protected]>
Signed-off-by: Zhi Lin <[email protected]>
Co-authored-by: Yunxuan Xiao <[email protected]> Signed-off-by: Zhi Lin <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the doc build is failing. Would you be able to take a look? I can also take a look later if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, not sure how to fix it
It's great to see that Ray Train has successfully extended its support beyond GPUs to other acceleration hardware. Interestingly, we are also preparing to contribute support for Huawei NPU. The adaptation process for torch_npu is very similar to torch_hpu. |
Hi @liuxsh9 , you are welcome to add support for NPU accelerator in Ray Train: ) If you already have the designs, feel free to post a PR and we can discuss the implementation details. |
@angelinalg can we pick this into latest; I think it might have just missed the branch cut date so right now the "latest" URL renders to a 404. Switching to master works fine though. https://docs.ray.io/en/latest/train/examples/hpu/resnet.html |
@liuxsh9 we should also have the discussion at the Core layer first as the foundational interface to custom chipsets start with Core with Ray Libs building on top of that. Let's connect online Xiaoshuang. |
Why are these changes needed?
To leverage the potential of Intel's Habana Processing Units (HPUs), we extend Ray Train's capabilities by adding support for Intel HPU hardware. This update also includes training scripts for BERT and ResNet models, which are widely used in NLP and computer vision tasks, respectively.
Related issue number
Replace #42866
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.