[Bug] Autoscaler doesn't scale CPU-only workloads to workers with GPU #20476

andras-kth · 2021-11-17T13:06:32Z

Search before asking

I searched the issues and found no similar issues.

Ray Component

Ray Clusters

What happened + What you expected to happen

Clusters where all nodes have a GPU fail to autoscale on CPU-only workloads.

The autoscaler could not find a node type to satisfy the request: [{'CPU': 1.0} ,...

Changing the resource definition of the node type (leaving everything else intact) allows the cluster to autoscale.

I'm guessing that this may, in fact, be the intended behavior, as a cost-saving "feature".
In which case, the right "fix" would be to define node types both with and without GPU.

Versions / Dependencies

Ray 1.8.0
Python 3.9

Reproduction script

N/A

Anything else

No response

Are you willing to submit a PR?

Yes I am willing to submit a PR!

The text was updated successfully, but these errors were encountered:

GPU auto scaling is a bug on Ray side. For more details, please see [this issue](ray-project/ray#20476).

* Support fractional resource scheduling * Fix java and scala code styling. * Fix tests. * Use marker to skip tests * Refactor * Use mock clusters. Use mock cluster based on doc here: https://docs.ray.io/en/latest/ray-core/examples/testing-tips.html#tip-4-create-a-mini-cluster-with-ray-cluster-utils-cluster * try to fix test by running the custom resource test separately. * Remove GPU resource config. GPU auto scaling is a bug on Ray side. For more details, please see [this issue](ray-project/ray#20476).

hora-anyscale · 2022-12-19T20:45:11Z

Per Triage Sync: Need to update docs to reflect this is intended behavior

DmitriGekhtman · 2022-12-19T22:01:03Z

The correct behavior is to avoid adding GPU workers when possible, but to add GPU workers when needed fulfill the workload.

The code for this is straightforward to implement (max 5 lines, plus a test)

…riority (#31202) Closes #20476: Instead of preventing GPU upscaling due to non-GPU tasks, prefer non-GPU nodes by assigning low utilization score to the GPU nodes. Signed-off-by: Dmitri Gekhtman <[email protected]>

…riority (ray-project#31202) Closes ray-project#20476: Instead of preventing GPU upscaling due to non-GPU tasks, prefer non-GPU nodes by assigning low utilization score to the GPU nodes. Signed-off-by: Dmitri Gekhtman <[email protected]> Signed-off-by: tmynn <[email protected]>

andras-kth added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Nov 17, 2021

jon-chuang mentioned this issue Dec 30, 2021

[Feature] [Autoscaler] Scaling Intelligently Based on Observed Resource Bottlenecks (related: task & actor profiling) #21301

Open

2 tasks

DmitriGekhtman assigned DmitriGekhtman and wuisawesome Jan 3, 2022

DmitriGekhtman added P2 Important issue, but not time-critical and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jan 3, 2022

wuisawesome added this to the Serverless Autoscaling milestone Jan 4, 2022

AmeerHajAli added the infra autoscaler, ray client, kuberay, related issues label Mar 26, 2022

pang-wu added a commit to pang-wu/raydp that referenced this issue Jul 29, 2022

Remove GPU resource config.

81ac79a

GPU auto scaling is a bug on Ray side. For more details, please see [this issue](ray-project/ray#20476).

DmitriGekhtman mentioned this issue Aug 31, 2022

[Autoscaler] The autoscaler could not find a node type to satisfy the request #27910

Closed

DmitriGekhtman added P1 Issue that should be fixed within a few weeks and removed P2 Important issue, but not time-critical labels Aug 31, 2022

pang-wu mentioned this issue Sep 15, 2022

Add document about Ray autoscaling issue oap-project/raydp#273

Merged

DmitriGekhtman removed their assignment Nov 19, 2022

hora-anyscale added docs An issue or change related to documentation P2 Important issue, but not time-critical and removed P1 Issue that should be fixed within a few weeks labels Dec 19, 2022

DmitriGekhtman mentioned this issue Dec 19, 2022

[Autoscaler] Allow GPU upscaling for non-GPU workloads, with lowest priority #31202

Merged

7 tasks

DmitriGekhtman closed this as completed in #31202 Dec 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Autoscaler doesn't scale CPU-only workloads to workers with GPU #20476

[Bug] Autoscaler doesn't scale CPU-only workloads to workers with GPU #20476

andras-kth commented Nov 17, 2021

hora-anyscale commented Dec 19, 2022

DmitriGekhtman commented Dec 19, 2022 •

edited

Loading

[Bug] Autoscaler doesn't scale CPU-only workloads to workers with GPU #20476

[Bug] Autoscaler doesn't scale CPU-only workloads to workers with GPU #20476

Comments

andras-kth commented Nov 17, 2021

Search before asking

Ray Component

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Anything else

Are you willing to submit a PR?

hora-anyscale commented Dec 19, 2022

DmitriGekhtman commented Dec 19, 2022 • edited Loading

DmitriGekhtman commented Dec 19, 2022 •

edited

Loading