-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] Autoscaler doesn't scale CPU-only workloads to workers with GPU #20476
Labels
Milestone
Comments
andras-kth
added
bug
Something that is supposed to be working; but isn't
triage
Needs triage (eg: priority, bug/not-bug, and owning component)
labels
Nov 17, 2021
DmitriGekhtman
added
P2
Important issue, but not time-critical
and removed
triage
Needs triage (eg: priority, bug/not-bug, and owning component)
labels
Jan 3, 2022
pang-wu
added a commit
to pang-wu/raydp
that referenced
this issue
Jul 29, 2022
GPU auto scaling is a bug on Ray side. For more details, please see [this issue](ray-project/ray#20476).
carsonwang
pushed a commit
to oap-project/raydp
that referenced
this issue
Aug 1, 2022
* Support fractional resource scheduling * Fix java and scala code styling. * Fix tests. * Use marker to skip tests * Refactor * Use mock clusters. Use mock cluster based on doc here: https://docs.ray.io/en/latest/ray-core/examples/testing-tips.html#tip-4-create-a-mini-cluster-with-ray-cluster-utils-cluster * try to fix test by running the custom resource test separately. * Remove GPU resource config. GPU auto scaling is a bug on Ray side. For more details, please see [this issue](ray-project/ray#20476).
DmitriGekhtman
added
P1
Issue that should be fixed within a few weeks
and removed
P2
Important issue, but not time-critical
labels
Aug 31, 2022
hora-anyscale
added
docs
An issue or change related to documentation
P2
Important issue, but not time-critical
and removed
P1
Issue that should be fixed within a few weeks
labels
Dec 19, 2022
Per Triage Sync: Need to update docs to reflect this is intended behavior |
The correct behavior is to avoid adding GPU workers when possible, but to add GPU workers when needed fulfill the workload. The code for this is straightforward to implement (max 5 lines, plus a test) |
7 tasks
DmitriGekhtman
added a commit
that referenced
this issue
Dec 21, 2022
…riority (#31202) Closes #20476: Instead of preventing GPU upscaling due to non-GPU tasks, prefer non-GPU nodes by assigning low utilization score to the GPU nodes. Signed-off-by: Dmitri Gekhtman <[email protected]>
AmeerHajAli
pushed a commit
that referenced
this issue
Jan 12, 2023
…riority (#31202) Closes #20476: Instead of preventing GPU upscaling due to non-GPU tasks, prefer non-GPU nodes by assigning low utilization score to the GPU nodes. Signed-off-by: Dmitri Gekhtman <[email protected]>
tamohannes
pushed a commit
to ju2ez/ray
that referenced
this issue
Jan 25, 2023
…riority (ray-project#31202) Closes ray-project#20476: Instead of preventing GPU upscaling due to non-GPU tasks, prefer non-GPU nodes by assigning low utilization score to the GPU nodes. Signed-off-by: Dmitri Gekhtman <[email protected]> Signed-off-by: tmynn <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Search before asking
Ray Component
Ray Clusters
What happened + What you expected to happen
Clusters where all nodes have a GPU fail to autoscale on CPU-only workloads.
Changing the resource definition of the node type (leaving everything else intact) allows the cluster to autoscale.
I'm guessing that this may, in fact, be the intended behavior, as a cost-saving "feature".
In which case, the right "fix" would be to define node types both with and without GPU.
Versions / Dependencies
Ray 1.8.0
Python 3.9
Reproduction script
N/A
Anything else
No response
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: