[Feature] [Autoscaler] Scaling Intelligently Based on Observed Resource Bottlenecks (related: task & actor profiling) #21301

jon-chuang · 2021-12-30T12:18:58Z

Search before asking

I had searched in the issues and found no similar feature requirement.

Description

For more efficient resource utilization, I believe the autoscaler, were it to be given a choice in the config over a selection of nodes it could add or remove, each with its own basket of resources, ought to be able to scale intelligently to better address resource bottlenecks based on the real-time observed resource utilization of the recently and presently running tasks, independently or in conjunction with the logical resource requirements of placement groups.

For instance, we might not want to add compute-optimized nodes if the observed bottleneck is memory, and may want to add memory-optimized nodes instead.

Use case

Cost savings for Ray users, and fewer encounters with resource strangling - another great selling point for Ray

Further Work

Related to better resource packing, consider also autoscaler awareness (both logical and profiled) of fractional GPU usage such as MIG (#12413, NVIDIA MIG blogpost)

Related issues

Memory aware scheduling & profiling-based placement: #20495

maybe related (autoscaler's poor understanding of resources): #20476

Related to task and actor profiling which will be written up an RFC soon

Are you willing to submit a PR?

Yes I am willing to submit a PR!

stale · 2022-04-30T11:17:21Z

Hi, I'm a bot from the Ray team :)

To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.

If there is no further activity in the 14 days, the issue will be closed!

If you'd like to keep the issue open, just leave any comment, and the stale label will be removed!
If you'd like to get more attention to the issue, please tag one of Ray's contributors.

You can always ask for help on our discussion forum or Ray's public slack channel.

sivaguru-pat · 2022-05-23T13:38:10Z

@ericl @jon-chuang is it on the road map?

pang-wu · 2022-07-29T01:30:48Z

+1, any update on this?

stale · 2022-12-23T15:15:18Z

Hi, I'm a bot from the Ray team :)

To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.

If there is no further activity in the 14 days, the issue will be closed!

If you'd like to keep the issue open, just leave any comment, and the stale label will be removed!
If you'd like to get more attention to the issue, please tag one of Ray's contributors.

You can always ask for help on our discussion forum or Ray's public slack channel.

stale · 2023-01-17T04:10:11Z

Hi again! The issue will be closed because there has been no more activity in the 14 days since the last message.

Please feel free to reopen or open a new issue if you'd still like it to be addressed.

Again, you can always ask for help on our discussion forum or Ray's public slack channel.

Thanks again for opening the issue!

jon-chuang added the enhancement Request for new feature and/or capability label Dec 30, 2021

jon-chuang changed the title ~~[Feature] Autoscaler: Scaling Based on Observed Resource Bottlenecks (related: task & actor profiling)~~ [Feature] [Autoscaler] Scaling Intelligently Based on Observed Resource Bottlenecks (related: task & actor profiling) Dec 30, 2021

stale bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Apr 30, 2022

stale bot removed the stale The issue is stale. It will be closed within 7 days unless there are further conversation label May 23, 2022

scv119 self-assigned this Aug 25, 2022

stale bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Dec 23, 2022

stale bot closed this as completed Jan 17, 2023

jjyao reopened this Nov 15, 2023

stale bot removed the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Nov 15, 2023

jjyao added the core-autoscaler autoscaler related issues label Nov 15, 2023

jjyao unassigned scv119 Nov 15, 2023

jjyao added P1.5 Issues that will be fixed in a couple releases. It will be bumped once all P1s are cleared core Issues that should be addressed in Ray Core labels Nov 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] [Autoscaler] Scaling Intelligently Based on Observed Resource Bottlenecks (related: task & actor profiling) #21301

[Feature] [Autoscaler] Scaling Intelligently Based on Observed Resource Bottlenecks (related: task & actor profiling) #21301

jon-chuang commented Dec 30, 2021 •

edited

Loading

stale bot commented Apr 30, 2022

sivaguru-pat commented May 23, 2022 •

edited

Loading

pang-wu commented Jul 29, 2022

stale bot commented Dec 23, 2022

stale bot commented Jan 17, 2023

[Feature] [Autoscaler] Scaling Intelligently Based on Observed Resource Bottlenecks (related: task & actor profiling) #21301

[Feature] [Autoscaler] Scaling Intelligently Based on Observed Resource Bottlenecks (related: task & actor profiling) #21301

Comments

jon-chuang commented Dec 30, 2021 • edited Loading

Search before asking

Description

Use case

Further Work

Related issues

Are you willing to submit a PR?

stale bot commented Apr 30, 2022

sivaguru-pat commented May 23, 2022 • edited Loading

pang-wu commented Jul 29, 2022

stale bot commented Dec 23, 2022

stale bot commented Jan 17, 2023

jon-chuang commented Dec 30, 2021 •

edited

Loading

sivaguru-pat commented May 23, 2022 •

edited

Loading