Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] [Autoscaler] Scaling Intelligently Based on Observed Resource Bottlenecks (related: task & actor profiling) #21301

Open
1 of 2 tasks
jon-chuang opened this issue Dec 30, 2021 · 5 comments
Labels
core Issues that should be addressed in Ray Core core-autoscaler autoscaler related issues enhancement Request for new feature and/or capability P1.5 Issues that will be fixed in a couple releases. It will be bumped once all P1s are cleared

Comments

@jon-chuang
Copy link
Contributor

jon-chuang commented Dec 30, 2021

Search before asking

  • I had searched in the issues and found no similar feature requirement.

Description

For more efficient resource utilization, I believe the autoscaler, were it to be given a choice in the config over a selection of nodes it could add or remove, each with its own basket of resources, ought to be able to scale intelligently to better address resource bottlenecks based on the real-time observed resource utilization of the recently and presently running tasks, independently or in conjunction with the logical resource requirements of placement groups.

For instance, we might not want to add compute-optimized nodes if the observed bottleneck is memory, and may want to add memory-optimized nodes instead.

Use case

Cost savings for Ray users, and fewer encounters with resource strangling - another great selling point for Ray

Further Work

Related to better resource packing, consider also autoscaler awareness (both logical and profiled) of fractional GPU usage such as MIG (#12413, NVIDIA MIG blogpost)

Related issues

Memory aware scheduling & profiling-based placement: #20495

maybe related (autoscaler's poor understanding of resources): #20476

Related to task and actor profiling which will be written up an RFC soon

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@jon-chuang jon-chuang added the enhancement Request for new feature and/or capability label Dec 30, 2021
@jon-chuang jon-chuang changed the title [Feature] Autoscaler: Scaling Based on Observed Resource Bottlenecks (related: task & actor profiling) [Feature] [Autoscaler] Scaling Intelligently Based on Observed Resource Bottlenecks (related: task & actor profiling) Dec 30, 2021
@stale
Copy link

stale bot commented Apr 30, 2022

Hi, I'm a bot from the Ray team :)

To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.

If there is no further activity in the 14 days, the issue will be closed!

  • If you'd like to keep the issue open, just leave any comment, and the stale label will be removed!
  • If you'd like to get more attention to the issue, please tag one of Ray's contributors.

You can always ask for help on our discussion forum or Ray's public slack channel.

@stale stale bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Apr 30, 2022
@sivaguru-pat
Copy link

sivaguru-pat commented May 23, 2022

@ericl @jon-chuang is it on the road map?

@stale stale bot removed the stale The issue is stale. It will be closed within 7 days unless there are further conversation label May 23, 2022
@pang-wu
Copy link

pang-wu commented Jul 29, 2022

+1, any update on this?

@scv119 scv119 self-assigned this Aug 25, 2022
@stale
Copy link

stale bot commented Dec 23, 2022

Hi, I'm a bot from the Ray team :)

To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.

If there is no further activity in the 14 days, the issue will be closed!

  • If you'd like to keep the issue open, just leave any comment, and the stale label will be removed!
  • If you'd like to get more attention to the issue, please tag one of Ray's contributors.

You can always ask for help on our discussion forum or Ray's public slack channel.

@stale stale bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Dec 23, 2022
@stale
Copy link

stale bot commented Jan 17, 2023

Hi again! The issue will be closed because there has been no more activity in the 14 days since the last message.

Please feel free to reopen or open a new issue if you'd still like it to be addressed.

Again, you can always ask for help on our discussion forum or Ray's public slack channel.

Thanks again for opening the issue!

@stale stale bot closed this as completed Jan 17, 2023
@jjyao jjyao reopened this Nov 15, 2023
@stale stale bot removed the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Nov 15, 2023
@jjyao jjyao added the core-autoscaler autoscaler related issues label Nov 15, 2023
@jjyao jjyao added P1.5 Issues that will be fixed in a couple releases. It will be bumped once all P1s are cleared core Issues that should be addressed in Ray Core labels Nov 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Issues that should be addressed in Ray Core core-autoscaler autoscaler related issues enhancement Request for new feature and/or capability P1.5 Issues that will be fixed in a couple releases. It will be bumped once all P1s are cleared
Projects
None yet
Development

No branches or pull requests

5 participants