Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AIR] Don't add Trainer resources when running on Colab #28822

Merged
merged 3 commits into from
Sep 28, 2022

Conversation

amogkam
Copy link
Contributor

@amogkam amogkam commented Sep 27, 2022

Signed-off-by: Amog Kamsetty [email protected]

Google Colab only has 2 CPUs. Because of this resource scarcity, we have to be careful on where these resources are allocated and users need to be hyper aware on all the tasks and actors that are reserving resources.

In AIR, the Trainer reserves 1 CPU by default which is unintuitive for users. As a stopgap solution, we special case when running on Google Colab so that the trainer does not reserve any resources, and num_workers=2 works for data parallel training. As Google Colab is not distributed, the scalability concerns on doing this are no longer applicable.

This has been a headache for me when running AIR on Google Colab and also for users as well: https://discuss.ray.io/t/ray-trainer-looking-for-more-cpus-than-that-of-its-initialized-on/7696

Why are these changes needed?

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Amog Kamsetty <[email protected]>
Signed-off-by: Amog Kamsetty <[email protected]>
@Yard1
Copy link
Member

Yard1 commented Sep 27, 2022

Hmm, I wonder if it would make sense to extend this to any single-node cluster case. Not sure if we have a foolproof way of determining that though, if we consider autoscaling. This seems fine as a stopgap if there is no straightforward way if generalizing it.

@amogkam
Copy link
Contributor Author

amogkam commented Sep 27, 2022

yeah I don't think ray provides good abstractions on figuring out cluster size and autoscaling behavior. Let's prioritize Colab for now since it has the most resource scarcity (only 2 CPUs), compared to other instances which have like 4-16.

We can cover all single node cases in a follow up.

@amogkam
Copy link
Contributor Author

amogkam commented Sep 27, 2022

Most people try out the AIR examples on either Colab or their own laptop. Laptops have 16 CPUs, so this is less of a problem. However, we can't have users churn if basic workflows don't work on Colab.

@Yard1
Copy link
Member

Yard1 commented Sep 27, 2022

This all makes sense, let's follow up later

@amogkam amogkam merged commit fa3200f into ray-project:master Sep 28, 2022
@amogkam amogkam deleted the air-no-trainer-resources-colab branch September 28, 2022 21:32
WeichenXu123 pushed a commit to WeichenXu123/ray that referenced this pull request Dec 19, 2022
…28822)

Signed-off-by: Amog Kamsetty [email protected]

Google Colab only has 2 CPUs. Because of this resource scarcity, we have to be careful on where these resources are allocated and users need to be hyper aware on all the tasks and actors that are reserving resources.

In AIR, the Trainer reserves 1 CPU by default which is unintuitive for users. As a stopgap solution, we special case when running on Google Colab so that the trainer does not reserve any resources, and num_workers=2 works for data parallel training. As Google Colab is not distributed, the scalability concerns on doing this are no longer applicable.

This has been a headache for me when running AIR on Google Colab and also for users as well: https://discuss.ray.io/t/ray-trainer-looking-for-more-cpus-than-that-of-its-initialized-on/7696

Signed-off-by: Weichen Xu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants