-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] [Autoscaler] Scaling Intelligently Based on Observed Resource Bottlenecks (related: task & actor profiling) #21301
Comments
Hi, I'm a bot from the Ray team :) To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months. If there is no further activity in the 14 days, the issue will be closed!
You can always ask for help on our discussion forum or Ray's public slack channel. |
@ericl @jon-chuang is it on the road map? |
+1, any update on this? |
Hi, I'm a bot from the Ray team :) To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months. If there is no further activity in the 14 days, the issue will be closed!
You can always ask for help on our discussion forum or Ray's public slack channel. |
Hi again! The issue will be closed because there has been no more activity in the 14 days since the last message. Please feel free to reopen or open a new issue if you'd still like it to be addressed. Again, you can always ask for help on our discussion forum or Ray's public slack channel. Thanks again for opening the issue! |
Search before asking
Description
For more efficient resource utilization, I believe the autoscaler, were it to be given a choice in the config over a selection of nodes it could add or remove, each with its own basket of resources, ought to be able to scale intelligently to better address resource bottlenecks based on the real-time observed resource utilization of the recently and presently running tasks, independently or in conjunction with the logical resource requirements of placement groups.
For instance, we might not want to add compute-optimized nodes if the observed bottleneck is memory, and may want to add memory-optimized nodes instead.
Use case
Cost savings for Ray users, and fewer encounters with resource strangling - another great selling point for Ray
Further Work
Related to better resource packing, consider also autoscaler awareness (both logical and profiled) of fractional GPU usage such as MIG (#12413, NVIDIA MIG blogpost)
Related issues
Memory aware scheduling & profiling-based placement: #20495
maybe related (autoscaler's poor understanding of resources): #20476
Related to task and actor profiling which will be written up an RFC soon
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: