-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Train] Split overloaded ray.train.torch.get_device
into another get_devices
API for multi-GPU worker setup
#42314
[Train] Split overloaded ray.train.torch.get_device
into another get_devices
API for multi-GPU worker setup
#42314
Conversation
Signed-off-by: woshiyyya <[email protected]>
Signed-off-by: woshiyyya <[email protected]>
Signed-off-by: woshiyyya <[email protected]>
Signed-off-by: woshiyyya <[email protected]>
Signed-off-by: woshiyyya <[email protected]>
Signed-off-by: woshiyyya <[email protected]>
Signed-off-by: woshiyyya <[email protected]>
Signed-off-by: woshiyyya <[email protected]>
Signed-off-by: woshiyyya <[email protected]>
Signed-off-by: woshiyyya <[email protected]>
Signed-off-by: woshiyyya <[email protected]>
ray.train.torch.get_device
into another get_devices
for multi-GPU worker setupray.train.torch.get_device
into another get_devices
API for multi-GPU worker setup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more round of suggestions for docs, then I think it's good.
- Add a small subsection called "Assigning multiple GPUs to a worker" that shows a small tested example recommending
get_devices
- Add
get_devices
to the API reference here: https://anyscale-ray--42314.com.readthedocs.build/en/42314/train/api/api.html#pytorch
Signed-off-by: woshiyyya <[email protected]>
Signed-off-by: woshiyyya <[email protected]>
|
||
from ray.air._internal import torch_utils | ||
|
||
record_extra_usage_tag(TagKey.TRAIN_TORCH_GET_DEVICE, "1") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should have a new TRAIN_TORCH_GET_DEVICES
key.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah previously Justin mentioned we can use a single Key for these two APIs. But now I think it makes more sense to have a separate one to get more accurate telemetry data.
@@ -63,11 +69,64 @@ def get_device() -> Union[torch.device, List[torch.device]]: | |||
>>> # ray.get_gpu_ids() == [4,5] | |||
>>> # torch.cuda.is_available() == True | |||
>>> # get_device() == torch.device("cuda:4") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, this wasn't actually working as expected before.
Co-authored-by: matthewdeng <[email protected]> Signed-off-by: Yunxuan Xiao <[email protected]>
Signed-off-by: woshiyyya <[email protected]>
Signed-off-by: woshiyyya <[email protected]>
Signed-off-by: woshiyyya <[email protected]>
@ArturNiederfahrenhorst Can you take a look? The only change to rllib is in rllib/core/learner/torch/torch_learner.py |
Yep, sorry! |
Signed-off-by: woshiyyya <[email protected]>
Signed-off-by: woshiyyya <[email protected]>
Signed-off-by: woshiyyya <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved for RLlib changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For usage.proto
Why are these changes needed?
The original
ray.train.torch.get_device
behaves inconsistently depending on the number of available devices on a ray train worker:The proposal involves two key changes:
ray.train.torch.get_device()
:ray.train.torch.get_devices()
:Example:
Single-GPU workers:
Multi-GPU workers (e.g. 2 gpus per worker):
Related issue number
Closes #42003, #38115
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.