[Train] Split overloaded `ray.train.torch.get_device` into another `get_devices` API for multi-GPU worker setup #42314

woshiyyya · 2024-01-11T00:09:43Z

Why are these changes needed?

The original ray.train.torch.get_device behaves inconsistently depending on the number of available devices on a ray train worker:

Single Device: Returns a single torch.device object.
Multiple Devices: Returns a list of torch.device objects.

The proposal involves two key changes:

Modification of ray.train.torch.get_device():
- New Behavior: This function will always return a single torch.device object. If multiple devices are available, it will return the device with smaller id.
- Rationale: Ensures consistent return type, simplifying the usage and handling in user code.
Introduction of ray.train.torch.get_devices():
- Behavior: This new function will return a list of torch.device objects, representing all available devices.
- Rationale: Provides a clear and explicit way to retrieve all devices for multi-gpu worker scenario.

Example:

Single-GPU workers:

def train_func():
    device = ray.train.torch.get_device() # torch.device("cuda:0") 
    devices = ray.train.torch.get_devices() # [torch.device("cuda:0")]

Multi-GPU workers (e.g. 2 gpus per worker):

def train_func():
    device = ray.train.torch.get_device() # torch.device("cuda:0")
    devices = ray.train.torch.get_devices() # [torch.device("cuda:0"), torch.device("cuda:1")]

Related issue number

Closes #42003, #38115

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: woshiyyya <[email protected]>

justinvyu

One more round of suggestions for docs, then I think it's good.

Add a small subsection called "Assigning multiple GPUs to a worker" that shows a small tested example recommending get_devices
- https://anyscale-ray--42314.com.readthedocs.build/en/42314/train/user-guides/using-gpus.html#using-gpus-in-the-training-function
Add get_devices to the API reference here: https://anyscale-ray--42314.com.readthedocs.build/en/42314/train/api/api.html#pytorch

Signed-off-by: woshiyyya <[email protected]>

python/ray/air/_internal/torch_utils.py

matthewdeng · 2024-01-29T21:17:36Z

python/ray/train/torch/train_loop_utils.py

+
+    from ray.air._internal import torch_utils
+
+    record_extra_usage_tag(TagKey.TRAIN_TORCH_GET_DEVICE, "1")


This should have a new TRAIN_TORCH_GET_DEVICES key.

Ah previously Justin mentioned we can use a single Key for these two APIs. But now I think it makes more sense to have a separate one to get more accurate telemetry data.

python/ray/air/_internal/torch_utils.py

python/ray/train/torch/train_loop_utils.py

matthewdeng · 2024-01-29T21:20:47Z

python/ray/train/torch/train_loop_utils.py

@@ -63,11 +69,64 @@ def get_device() -> Union[torch.device, List[torch.device]]:
        >>> # ray.get_gpu_ids() == [4,5]
        >>> # torch.cuda.is_available() == True
        >>> # get_device() == torch.device("cuda:4")


Nice, this wasn't actually working as expected before.

python/ray/train/torch/train_loop_utils.py

Co-authored-by: matthewdeng <[email protected]> Signed-off-by: Yunxuan Xiao <[email protected]>

Signed-off-by: woshiyyya <[email protected]>

woshiyyya · 2024-01-29T22:32:47Z

@ArturNiederfahrenhorst Can you take a look? The only change to rllib is in rllib/core/learner/torch/torch_learner.py

ArturNiederfahrenhorst · 2024-01-29T22:33:38Z

Yep, sorry!

rllib/core/learner/torch/torch_learner.py

Signed-off-by: woshiyyya <[email protected]>

ArturNiederfahrenhorst

Approved for RLlib changes

justinvyu

Very nice!

pcmoritz

For usage.proto

woshiyyya and others added 11 commits January 10, 2024 16:07

init

dfad3e6

Signed-off-by: woshiyyya <[email protected]>

fix lint

ed93ecf

Signed-off-by: woshiyyya <[email protected]>

fix

7fb61ea

Signed-off-by: woshiyyya <[email protected]>

fix

25f839a

Signed-off-by: woshiyyya <[email protected]>

fix

56ddd9f

Signed-off-by: woshiyyya <[email protected]>

fix import and expose get_devices as public API

2362a74

Signed-off-by: woshiyyya <[email protected]>

fix docstring

c475d6f

Signed-off-by: woshiyyya <[email protected]>

fix tests

d1f8f90

Signed-off-by: woshiyyya <[email protected]>

fix tagkey

9c85e0a

Signed-off-by: woshiyyya <[email protected]>

fix monkey patch

64a25f1

Signed-off-by: woshiyyya <[email protected]>

Merge branch 'master' into train/refractor_get_device

9c08275

woshiyyya marked this pull request as ready for review January 19, 2024 20:51

woshiyyya requested review from ericl, scv119, c21, amogkam, scottjlee, bveeramani, raulchen, stephanie-wang, Zandew, pcmoritz and thomasdesr as code owners January 19, 2024 20:51

woshiyyya assigned justinvyu and matthewdeng Jan 19, 2024

update

de63ce7

Signed-off-by: woshiyyya <[email protected]>

woshiyyya requested review from sven1977, avnishn, ArturNiederfahrenhorst and smorad as code owners January 23, 2024 18:11

justinvyu changed the title ~~[Train] Split overloaded ray.train.torch.get_device into another get_devices for multi-GPU worker setup~~ [Train] Split overloaded ray.train.torch.get_device into another get_devices API for multi-GPU worker setup Jan 29, 2024

justinvyu reviewed Jan 29, 2024

View reviewed changes

raulchen approved these changes Jan 29, 2024

View reviewed changes

add examples in docstring

c24b740

Signed-off-by: woshiyyya <[email protected]>

woshiyyya requested review from richardliaw, krfricke, xwjiang2010, matthewdeng, Yard1 and a team as code owners January 29, 2024 20:50

update

10a42a9

Signed-off-by: woshiyyya <[email protected]>

matthewdeng reviewed Jan 29, 2024

View reviewed changes

woshiyyya and others added 5 commits January 29, 2024 13:39

Apply suggestions from code review

8832876

Co-authored-by: matthewdeng <[email protected]> Signed-off-by: Yunxuan Xiao <[email protected]>

update proto

0d83f42

Signed-off-by: woshiyyya <[email protected]>

fix lint

193d527

Signed-off-by: woshiyyya <[email protected]>

Merge branch 'master' into train/refractor_get_device

9ff4cd3

fix docstring

8e171fa

Signed-off-by: woshiyyya <[email protected]>

woshiyyya assigned ArturNiederfahrenhorst and unassigned kouroshHakha Jan 29, 2024

ArturNiederfahrenhorst reviewed Jan 29, 2024

View reviewed changes

rllib/core/learner/torch/torch_learner.py Show resolved Hide resolved

woshiyyya added 3 commits January 29, 2024 15:07

add assertion for torch_learner

1b8639b

Signed-off-by: woshiyyya <[email protected]>

fix docstring

aaa0793

Signed-off-by: woshiyyya <[email protected]>

fix test

9a52ae9

Signed-off-by: woshiyyya <[email protected]>

ArturNiederfahrenhorst approved these changes Jan 30, 2024

View reviewed changes

justinvyu approved these changes Jan 30, 2024

View reviewed changes

pcmoritz approved these changes Jan 30, 2024

View reviewed changes

matthewdeng merged commit d7a4f25 into ray-project:master Jan 30, 2024
9 checks passed

simonsays1980 mentioned this pull request Feb 22, 2024

[RLlib] DQN Rainbow on new API stack: RLModule and Catalog together with TorchNoisyMLP. #43199

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Train] Split overloaded `ray.train.torch.get_device` into another `get_devices` API for multi-GPU worker setup #42314

[Train] Split overloaded `ray.train.torch.get_device` into another `get_devices` API for multi-GPU worker setup #42314

woshiyyya commented Jan 11, 2024 •

edited

Loading

justinvyu left a comment

matthewdeng Jan 29, 2024

woshiyyya Jan 29, 2024

matthewdeng Jan 29, 2024

woshiyyya commented Jan 29, 2024 •

edited

Loading

ArturNiederfahrenhorst commented Jan 29, 2024

ArturNiederfahrenhorst left a comment

justinvyu left a comment

pcmoritz left a comment


		from ray.air._internal import torch_utils

		record_extra_usage_tag(TagKey.TRAIN_TORCH_GET_DEVICE, "1")

[Train] Split overloaded ray.train.torch.get_device into another get_devices API for multi-GPU worker setup #42314

[Train] Split overloaded ray.train.torch.get_device into another get_devices API for multi-GPU worker setup #42314

Conversation

woshiyyya commented Jan 11, 2024 • edited Loading

Why are these changes needed?

Example:

Related issue number

Checks

justinvyu left a comment

Choose a reason for hiding this comment

matthewdeng Jan 29, 2024

Choose a reason for hiding this comment

woshiyyya Jan 29, 2024

Choose a reason for hiding this comment

matthewdeng Jan 29, 2024

Choose a reason for hiding this comment

woshiyyya commented Jan 29, 2024 • edited Loading

ArturNiederfahrenhorst commented Jan 29, 2024

ArturNiederfahrenhorst left a comment

Choose a reason for hiding this comment

justinvyu left a comment

Choose a reason for hiding this comment

pcmoritz left a comment

Choose a reason for hiding this comment

[Train] Split overloaded `ray.train.torch.get_device` into another `get_devices` API for multi-GPU worker setup #42314

[Train] Split overloaded `ray.train.torch.get_device` into another `get_devices` API for multi-GPU worker setup #42314

woshiyyya commented Jan 11, 2024 •

edited

Loading

woshiyyya commented Jan 29, 2024 •

edited

Loading