[AIR] Don't add Trainer resources when running on Colab #28822

amogkam · 2022-09-27T19:15:51Z

Signed-off-by: Amog Kamsetty [email protected]

Google Colab only has 2 CPUs. Because of this resource scarcity, we have to be careful on where these resources are allocated and users need to be hyper aware on all the tasks and actors that are reserving resources.

In AIR, the Trainer reserves 1 CPU by default which is unintuitive for users. As a stopgap solution, we special case when running on Google Colab so that the trainer does not reserve any resources, and num_workers=2 works for data parallel training. As Google Colab is not distributed, the scalability concerns on doing this are no longer applicable.

This has been a headache for me when running AIR on Google Colab and also for users as well: https://discuss.ray.io/t/ray-trainer-looking-for-more-cpus-than-that-of-its-initialized-on/7696

Why are these changes needed?

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Amog Kamsetty <[email protected]>

Yard1 · 2022-09-27T19:19:46Z

Hmm, I wonder if it would make sense to extend this to any single-node cluster case. Not sure if we have a foolproof way of determining that though, if we consider autoscaling. This seems fine as a stopgap if there is no straightforward way if generalizing it.

amogkam · 2022-09-27T19:24:23Z

yeah I don't think ray provides good abstractions on figuring out cluster size and autoscaling behavior. Let's prioritize Colab for now since it has the most resource scarcity (only 2 CPUs), compared to other instances which have like 4-16.

We can cover all single node cases in a follow up.

amogkam · 2022-09-27T19:25:33Z

Most people try out the AIR examples on either Colab or their own laptop. Laptops have 16 CPUs, so this is less of a problem. However, we can't have users churn if basic workflows don't work on Colab.

Yard1 · 2022-09-27T19:36:13Z

This all makes sense, let's follow up later

…no-trainer-resources-colab

…28822) Signed-off-by: Amog Kamsetty [email protected] Google Colab only has 2 CPUs. Because of this resource scarcity, we have to be careful on where these resources are allocated and users need to be hyper aware on all the tasks and actors that are reserving resources. In AIR, the Trainer reserves 1 CPU by default which is unintuitive for users. As a stopgap solution, we special case when running on Google Colab so that the trainer does not reserve any resources, and num_workers=2 works for data parallel training. As Google Colab is not distributed, the scalability concerns on doing this are no longer applicable. This has been a headache for me when running AIR on Google Colab and also for users as well: https://discuss.ray.io/t/ray-trainer-looking-for-more-cpus-than-that-of-its-initialized-on/7696 Signed-off-by: Weichen Xu <[email protected]>

add

808ce31

Signed-off-by: Amog Kamsetty <[email protected]>

amogkam assigned matthewdeng and Yard1 Sep 27, 2022

update comment

5697783

Signed-off-by: Amog Kamsetty <[email protected]>

Yard1 approved these changes Sep 27, 2022

View reviewed changes

Merge branch 'master' of https://github.com/ray-project/ray into air-…

e29cf04

…no-trainer-resources-colab

amogkam merged commit fa3200f into ray-project:master Sep 28, 2022

amogkam deleted the air-no-trainer-resources-colab branch September 28, 2022 21:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AIR] Don't add Trainer resources when running on Colab #28822

[AIR] Don't add Trainer resources when running on Colab #28822

amogkam commented Sep 27, 2022

Yard1 commented Sep 27, 2022 •

edited

Loading

amogkam commented Sep 27, 2022

amogkam commented Sep 27, 2022

Yard1 commented Sep 27, 2022

[AIR] Don't add Trainer resources when running on Colab #28822

[AIR] Don't add Trainer resources when running on Colab #28822

Conversation

amogkam commented Sep 27, 2022

Why are these changes needed?

Related issue number

Checks

Yard1 commented Sep 27, 2022 • edited Loading

amogkam commented Sep 27, 2022

amogkam commented Sep 27, 2022

Yard1 commented Sep 27, 2022

Yard1 commented Sep 27, 2022 •

edited

Loading