[Train] Decouple device-related modules and add Huawei NPU support to Ray Train #44086

liuxsh9 · 2024-03-18T12:29:09Z

Why are these changes needed?

We are looking to expand the hardware support range of Ray Train by incorporating Huawei Ascend NPU support.

However, as the number of hardware types increases, scattered and device-specific modifications have been made to the code, which can impact future compatibility and maintainability.

To address this, we have extracted the device-related modules from Ray Train and consolidated them into the accelerator_utils. This allows for greater independence among the device-specific code, resulting in improved maintainability.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

liuxsh9 · 2024-03-19T06:19:53Z

This is our plan to enhance the support for third-party devices in Ray Train, which will also contribute to expanding device compatibility in Rllib. Looking forward to receiving your feedback @woshiyyya . Kindly invite developers working with HPU, NPU, and AMD GPUs to stay informed and engaged with these updates. @kira-lin @matthewdeng @vickytsang @nemo9cby @Bye-legumes

woshiyyya · 2024-03-19T22:53:58Z

Thanks for the contribution! We'll review it soon:)

woshiyyya

Hi @liuxsh9 , thanks for the contribution! Left some comments.

I personally like the idea of abstracting out the accelerator concept, which will be easier to generalize to other new accelerator types.

python/ray/air/_internal/accelerator_utils/__init__.py

python/ray/air/_internal/torch_utils.py

python/ray/air/_internal/accelerator_utils/__init__.py

python/ray/air/_internal/accelerator_utils/npu.py

python/ray/air/_internal/accelerator_utils/__init__.py

python/ray/air/_internal/accelerator_utils/npu.py

python/ray/air/_internal/torch_utils.py

python/ray/air/_internal/accelerator_utils/nvidia_gpu.py

python/ray/air/_internal/torch_utils.py

liuxsh9 · 2024-04-11T11:08:02Z

Hi @woshiyyya, may I ask if the modifications made above have met your expectations? If you have any other concerns, we would be happy to provide further information.

woshiyyya · 2024-04-12T23:55:17Z

@liuxsh9 Sure! will take another look these days!

anyscalesam · 2024-05-14T16:29:43Z

@woshiyyya can you follow up? im marking it as p2 for now @liuxsh9 as not seeing any users blocked on this; do tell me if wrong though and we can juggle priority here.

woshiyyya · 2024-05-14T17:15:15Z

Hi @anyscalesam @liuxsh9 , This PR involves major changes to Ray Train equipment management, which needs to be fully tested to ensure stability. We will discuss with Train team for the next steps and give an update soon.

abhilash1910 · 2024-05-22T04:05:03Z

@woshiyyya This looks like a nice accelerator abstraction , and we plan to refactor the Intel GPU backend as well .
A modification PR would be following this .
Thanks for the design implementation: @liuxsh9

woshiyyya

Hi @liuxsh9 , left some comments. Here are some suggestions:

Instead of defining methods in DeviceManager all as global staticmethods, let's make some of them instance methods (e.g. get_devices, is_device_available), and initialize a DeviceManager object for each worker.
Specify the allocated device object while initializing the DeviceManager object.
Explicitly choose the DeviceManager type, instead of determining by the available resources in the current worker.

Any thoughts?

python/ray/air/_internal/accelerator_utils/hpu.py

python/ray/air/_internal/accelerator_utils/nvidia_gpu.py

python/ray/air/_internal/accelerator_utils/utils.py

python/ray/air/_internal/accelerator_utils/device_manager.py

python/ray/air/_internal/accelerator_utils/npu.py

python/ray/air/_internal/accelerator_utils/device_manager.py

…ay Train. Signed-off-by: liuxsh9 <[email protected]>

accident -- some changes are still required

1. Replace hard-coding string with constant from ray_constants 2. Adjust the timing of registering the torch accelerator module. 3. Adjust the test cases. Signed-off-by: liuxsh9 <[email protected]>

…RCH_DEVICE_MANAGER_CLS` to `CPUTorchDeviceManager` Signed-off-by: liuxsh9 <[email protected]>

Signed-off-by: liuxsh9 <[email protected]>

justinvyu

Thanks! Just a few nits here -- I will run some GPU release tests and then we can merge.

python/ray/air/_internal/device_manager/torch_device_manager.py

python/ray/air/_internal/device_manager/hpu.py

justinvyu · 2024-08-14T22:11:43Z

python/ray/air/_internal/device_manager/npu.py

+from ray.air._internal.device_manager.torch_device_manager import TorchDeviceManager
+
+
+@lru_cache()


how does this play with multi-node settings? does this lru cache var get shipped over to other nodes?

ex: a CPU head node would possibly have this as false, but we don't want to keep that around to the worker nodes.

Thank you for your reminder, the cached var won't shipped over to other nodes. But even so, it is removed to avoid unnecessary code complexity.

justinvyu · 2024-08-14T22:13:19Z

python/ray/train/torch/torch_checkpoint.py

@@ -16,6 +17,7 @@
 if TYPE_CHECKING:
    from ray.data.preprocessor import Preprocessor

+try_register_torch_accelerator_module()


can we remove this?

Yes, has been removed.

python/ray/train/torch/train_loop_utils.py

python/ray/air/_internal/device_manager/__init__.py

1. Fix nits. 2. Raise clear runtime error when accelerator is allocated but unavailable. 3. Removed redundant module registrations. Signed-off-by: liuxsh9 <[email protected]> 0#

Signed-off-by: liuxsh9 <[email protected]>

justinvyu · 2024-08-15T21:49:24Z

python/ray/train/torch/train_loop_utils.py

+        # reset device is needed for npu in a new thread so far.
+        if device.type == "npu":
+            self.device_manager.set_device(device)


Why are we running in a new thread here? Can we remove this?

This logic probably has something to do with the test failure.

Why are we running in a new thread here? Can we remove this?

Has been removed.

This logic probably has something to do with the test failure.

The test called get_torch_device_manager locally (not in @remote), which without RuntimeContext. Fixed it by getting device directly use device.type in _WrappedDataLoader.

python/ray/air/_internal/device_manager/__init__.py

justinvyu · 2024-08-15T23:17:42Z

python/ray/air/_internal/device_manager/__init__.py

@matthewdeng WDYT?

python/ray/air/_internal/device_manager/__init__.py

Signed-off-by: liuxsh9 <[email protected]>

justinvyu

Great job, LGTM! Thanks for the patience in this effort! 🚢 🚀

python/ray/train/torch/config.py

justinvyu · 2024-08-22T16:58:17Z

Starting a release test sanity check here: https://buildkite.com/ray-project/release/builds/21289

matthewdeng · 2024-08-22T20:58:16Z

python/ray/air/_internal/device_manager/__init__.py

+def register_custom_torch_dist_backend(backend: Optional[str] = None) -> None:
+    if backend == "hccl":
+        # The name for the communication backend of Habana and torch-npu is the same.
+        HPUTorchDeviceManager.register_custom_torch_dist_backend()
+
+        NPUTorchDeviceManager.register_custom_torch_dist_backend()
+
+
+def get_torch_device_manager_cls_by_resources(


Any reason not to define these functions in torch_device_manager.py instead?

Currently the dependency is that these methods depend on various XPUDeviceManager implementations, and the XPUDeviceManager depend on TorchDeviceManager in torch_device_manager.py.

If we put these methods into the torch_device_manager.py, it may lead to circular import issues.

We refer the design of AcceleratorManager, it put the higher-level API functions in the __init__.py.

My understanding may be limited, could you please provide more information?

matthewdeng · 2024-08-22T21:21:18Z

python/ray/air/_internal/device_manager/__init__.py

+_torch_device_manager = None
+_torch_device_manager_lock = threading.Lock()


Why do we need a global variable to track this? Can this not just be tracked as an instance variable by the caller?

One issue with global variables is that we want these functions to be called by the individual workers, so they aren't pointing to these references.

This addresses previous reviewer's feedback. Additionally, we will check for _torch_device_manager within the worker and initialize it if it doesn't exist. Can this meet the users' requirements?

matthewdeng · 2024-08-22T21:45:31Z

python/ray/air/_internal/device_manager/__init__.py

+_torch_device_manager_lock = threading.Lock()
+
+
+def get_torch_device_manager(device_type: Optional[str] = None) -> TorchDeviceManager:


IMO we should consider removing this function. It's convenient to have a single function like this, but for all the usages it seems we want one explicit path (either with or without a device), so just calling that explicit logic directly is easier to follow.

Good suggestion, this has been modified to have two clear function entry points

1. Clearly categorize the methods for obtaining deviceManager into two classes. 2. Lazily instantiate the device manager whenever the first call to get_torch_device_manager happens. Signed-off-by: liuxsh9 <[email protected]>

Signed-off-by: matthewdeng <[email protected]>

matthewdeng

thanks!

Signed-off-by: liuxsh9 <[email protected]>

liuxsh9 · 2024-09-03T01:19:38Z

Hi @woshiyyya , I think I've addressed all the concerns you raised in previous review. However, the PR is blocked on your approval. Could you please take another look and let me know if everything looks good now?

woshiyyya

Approved. @liuxsh9 Thanks for the contribution!

… Ray Train (ray-project#44086) We are looking to expand the hardware support range of Ray Train by incorporating Huawei Ascend NPU support. However, as the number of hardware types increases, scattered and device-specific modifications have been made to the code, which can impact future compatibility and maintainability. To address this, we have extracted the device-related modules from Ray Train and consolidated them into the `accelerator_utils`. This allows for greater independence among the device-specific code, resulting in improved maintainability. Signed-off-by: liuxsh9 <[email protected]> Signed-off-by: matthewdeng <[email protected]> Co-authored-by: matthewdeng <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

liuxsh9 requested review from matthewdeng, justinvyu and woshiyyya as code owners March 18, 2024 12:29

liuxsh9 changed the title ~~[Train] Decouple device-related modules and add Huawei NPU support to Ray Train~~ [WIP][Train] Decouple device-related modules and add Huawei NPU support to Ray Train Mar 18, 2024

woshiyyya self-assigned this Mar 18, 2024

liuxsh9 force-pushed the train-support-npu branch 2 times, most recently from c72e9cc to e9a7311 Compare March 19, 2024 05:02

liuxsh9 changed the title ~~[WIP][Train] Decouple device-related modules and add Huawei NPU support to Ray Train~~ [Train] Decouple device-related modules and add Huawei NPU support to Ray Train Mar 19, 2024

woshiyyya assigned matthewdeng Mar 19, 2024

woshiyyya requested changes Mar 26, 2024

View reviewed changes

liuxsh9 force-pushed the train-support-npu branch 3 times, most recently from 1e9fc35 to dce7e54 Compare April 3, 2024 02:46

anyscalesam added triage Needs triage (eg: priority, bug/not-bug, and owning component) train Ray Train Related Issue labels Apr 29, 2024

anyscalesam added P2 Important issue, but not time-critical and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels May 14, 2024

woshiyyya assigned justinvyu May 24, 2024

woshiyyya requested changes May 31, 2024

View reviewed changes

woshiyyya reviewed May 31, 2024

View reviewed changes

python/ray/air/_internal/accelerator_utils/device_manager.py Outdated Show resolved Hide resolved

hongpeng-guo self-assigned this Jun 1, 2024

Introduce TorchDeviceManager to ray TrainSession and support NPU in R…

ebcccca

…ay Train. Signed-off-by: liuxsh9 <[email protected]>

liuxsh9 force-pushed the train-support-npu branch from c70fae0 to ebcccca Compare June 7, 2024 07:53

liuxsh9 added 6 commits August 7, 2024 16:31

Refine the code based on the feedback from review.

dee4745

1. Replace hard-coding string with constant from ray_constants 2. Adjust the timing of registering the torch accelerator module. 3. Adjust the test cases. Signed-off-by: liuxsh9 <[email protected]>

Introduce CPUTorchDeviceManager and change the value of^CDEFAULT_TO…

e6a2f4a

…RCH_DEVICE_MANAGER_CLS` to `CPUTorchDeviceManager` Signed-off-by: liuxsh9 <[email protected]>

delete fall back logic in CUDATorchDevicaManager

9c5a296

Signed-off-by: liuxsh9 <[email protected]>

Merge branch 'master' into train-support-npu

68bc4e1

fix

16094f1

Signed-off-by: liuxsh9 <[email protected]>

refine code

82f27b1

Signed-off-by: liuxsh9 <[email protected]>

justinvyu reviewed Aug 14, 2024

View reviewed changes

liuxsh9 added 3 commits August 15, 2024 17:50

Refine code based on the review feedback.

7973ddf

1. Fix nits. 2. Raise clear runtime error when accelerator is allocated but unavailable. 3. Removed redundant module registrations. Signed-off-by: liuxsh9 <[email protected]> 0#

fix

b83d1ff

Signed-off-by: liuxsh9 <[email protected]>

Merge branch 'master' into train-support-npu

199e572

justinvyu added the go add ONLY when ready to merge, run all tests label Aug 15, 2024

justinvyu reviewed Aug 15, 2024

View reviewed changes

liuxsh9 added 4 commits August 16, 2024 12:15

Fix device manager logic in local environment.

1c0d115

Signed-off-by: liuxsh9 <[email protected]>

fix

2c4c5cc

Signed-off-by: liuxsh9 <[email protected]>

fix typo and remove unnecessary operation for npu.

d8a0900

Signed-off-by: liuxsh9 <[email protected]>

move implementation into subclass

93a9bcf

Signed-off-by: liuxsh9 <[email protected]>

justinvyu approved these changes Aug 22, 2024

View reviewed changes

python/ray/train/torch/config.py Outdated Show resolved Hide resolved

matthewdeng reviewed Aug 22, 2024

View reviewed changes

liuxsh9 and others added 2 commits August 23, 2024 14:50

Update code based on the review feedback.

761ac5c

1. Clearly categorize the methods for obtaining deviceManager into two classes. 2. Lazily instantiate the device manager whenever the first call to get_torch_device_manager happens. Signed-off-by: liuxsh9 <[email protected]>

Merge branch 'master' into train-support-npu

657b1f9

Signed-off-by: matthewdeng <[email protected]>

matthewdeng approved these changes Aug 27, 2024

View reviewed changes

fix lint

a5549cc

Signed-off-by: liuxsh9 <[email protected]>

matthewdeng enabled auto-merge (squash) August 28, 2024 17:56

woshiyyya approved these changes Sep 3, 2024

View reviewed changes

matthewdeng merged commit eda6d09 into ray-project:master Sep 3, 2024
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Train] Decouple device-related modules and add Huawei NPU support to Ray Train #44086

[Train] Decouple device-related modules and add Huawei NPU support to Ray Train #44086

liuxsh9 commented Mar 18, 2024

liuxsh9 commented Mar 19, 2024

woshiyyya commented Mar 19, 2024 •

edited

Loading

woshiyyya left a comment

liuxsh9 commented Apr 11, 2024

woshiyyya commented Apr 12, 2024

anyscalesam commented May 14, 2024

woshiyyya commented May 14, 2024

abhilash1910 commented May 22, 2024

woshiyyya left a comment •

edited

Loading

justinvyu left a comment

justinvyu Aug 14, 2024

liuxsh9 Aug 20, 2024

justinvyu Aug 14, 2024

liuxsh9 Aug 19, 2024

justinvyu Aug 15, 2024

justinvyu Aug 15, 2024

liuxsh9 Aug 19, 2024

justinvyu Aug 15, 2024

justinvyu left a comment

justinvyu commented Aug 22, 2024

matthewdeng Aug 22, 2024

liuxsh9 Aug 23, 2024

matthewdeng Aug 22, 2024

matthewdeng Aug 22, 2024

liuxsh9 Aug 23, 2024

matthewdeng Aug 22, 2024

liuxsh9 Aug 23, 2024

matthewdeng left a comment

liuxsh9 commented Sep 3, 2024

woshiyyya left a comment •

edited

Loading

		from ray.air._internal.device_manager.torch_device_manager import TorchDeviceManager


		@lru_cache()

		_torch_device_manager = None
		_torch_device_manager_lock = threading.Lock()

		_torch_device_manager_lock = threading.Lock()


		def get_torch_device_manager(device_type: Optional[str] = None) -> TorchDeviceManager:

[Train] Decouple device-related modules and add Huawei NPU support to Ray Train #44086

[Train] Decouple device-related modules and add Huawei NPU support to Ray Train #44086

Conversation

liuxsh9 commented Mar 18, 2024

Why are these changes needed?

Related issue number

Checks

liuxsh9 commented Mar 19, 2024

woshiyyya commented Mar 19, 2024 • edited Loading

woshiyyya left a comment

Choose a reason for hiding this comment

liuxsh9 commented Apr 11, 2024

woshiyyya commented Apr 12, 2024

anyscalesam commented May 14, 2024

woshiyyya commented May 14, 2024

abhilash1910 commented May 22, 2024

woshiyyya left a comment • edited Loading

Choose a reason for hiding this comment

justinvyu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

justinvyu left a comment

Choose a reason for hiding this comment

justinvyu commented Aug 22, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

matthewdeng left a comment

Choose a reason for hiding this comment

liuxsh9 commented Sep 3, 2024

woshiyyya left a comment • edited Loading

Choose a reason for hiding this comment

woshiyyya commented Mar 19, 2024 •

edited

Loading

woshiyyya left a comment •

edited

Loading

woshiyyya left a comment •

edited

Loading