cluster autoscaler not scaling up the autoscaling group when already downscaled to 0 #4893

vkkumarswamy · 2022-05-17T04:45:33Z

This happens when autoscaling group is downscaled to 0. ie desired capacity is set to 0. Now if I start the cluster autoscaler and start a pod which requires a node from this autoscaling group. Some how autoscaling is not happening.
I have defined node affinity towards this autoscaling group.

Below is the event log from pod describe.
Normal NotTriggerScaleUp 4m15s (x121 over 24m) cluster-autoscaler pod didn't trigger scale-up: 1 node(s) didn't match Pod's node affinity/selector

But it works when I manually set the desired capacity 1 (This is when cluster autoscaler is already running) and set back the desired capacity to 0. And make a new pod deployment.
Looks like cluster autoscaler is not getting the nodes details associated with the autoscaling group at the start up when desired capacity is set to 0.

WebSpider · 2022-05-20T07:25:22Z

I'm seeing this too and am a bit puzzled as to the solution

vkkumarswamy · 2022-05-26T08:04:12Z

Any updates on this ?. Or any workaround you would suggest ?

ZTGallagher · 2022-07-07T15:11:53Z

I'm experiencing the same thing. Did you ever find an answer to this?

It can scale up from 0 as expected only after I've scaled it up at least once manually while cluster autoscaler is running.

So I assume it caches somehow, somewhere node info and relates it to the ASG. "Ooooh, this node has a gpu! Ok". However, I HAVE the "node-template" label and resource tags I'm "supposed to", for it to scale from 0 on its own from the ASG. And yet I still have to scale up once manually before it can scale up from 0 itself.

rshad · 2022-07-27T15:36:04Z

I am facing the same issue, and this looks like a random behavior, sometimes it works, and sometimes it does until I scale the node group once manually.

DesmondH0 · 2022-08-03T14:14:41Z

We are also experiencing the same issue, and I did dig a little bit

So we have a NodeGroup with 0 node as a start with taint, and we deploy the cluster-autoscaler
We deploy a pod which calling that NodeGroup with toloration
We found this log from cluster-autoscaler:

I0803 11:24:40.153592       1 scale_up.go:300] Pod ${pod} can't be scheduled on ${desired_ASG}, predicate checking error: node(s) didn't match Pod's node affinity/selector; predicateName=NodeAffinity; reasons: node(s) didn't match Pod's node affinity/selector; debugInfo=

3.1, so it did try to calculate with the NodeGroup which it should be triggering scale up
4. From the code, it seems try to run CheckPredicates and I assume the clusterSnapshot seems don't have enough info for the ASG if it come from 0 🤔 ?

rshad · 2022-08-04T07:55:47Z

The official documentation already covers this. As we are scaling up from capacity 0, this is not possible by default in the Cluster Autoscaler, and to do so, they indicated in the official documentation that we need to manually add a tag with the corresponding node-group label used by the Gitlab Runner jobs' node selectors to the corresponding autoscaling group. Currently is not possible via CDK to get the node group's Autoscaling group, so this can only be added manually.

The tag for the label gitlab-runner-type/heavy is as follows:

key: k8s.io/cluster-autoscaler/node-template/label/gitlab-runner-type
value: heavy

I tested it and it works.

ZTGallagher · 2022-08-04T08:39:44Z

@rshad

I appreciate the response and recognize you're right. In my case, however, we are tagging the ASGs and they're still not coming up properly.

0/1 nodes are available: 1 Insufficient nvidia.com/gpu

The ASGs are, however, tagged with k8s.io/cluster-autoscaler/node-template/resources/nvidia.com/gpu=1

I'm not sure what, then can be the problem.

rshad · 2022-08-04T10:57:13Z

@ZTGallagher

What is the label in your case? I see that you want to use a non-label tag as a label for the node-selector. As they indicate, the tag should be label not resource. So, your tag should be:

k8s.io/cluster-autoscaler/node-template/label/nvidia-gpu=1

And the label should be as:

nvidia-gpu: 1

k8s-triage-robot · 2022-11-02T11:04:05Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2022-12-02T11:36:16Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

WebSpider · 2022-12-03T07:08:45Z

/remove-lifecycle rotten

k8s-triage-robot · 2023-03-03T07:55:34Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2023-04-02T08:24:43Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

WebSpider · 2023-04-08T17:44:22Z

/remove-lifecycle rotten

dogzzdogzz · 2023-05-30T11:21:03Z

Experience same issue, from below log I think that cluster-autoscaler would remember all taints on the last node of ASG in memory before scaling down to zero regardless some taints were added automatically by other service and not existing in ASG's tags, so that if new pods does not have a toleration for the additional taint added by other service, cluster-autoscaler would think the pod is untolerated to the node group.

I0530 10:59:16.283578       1 scale_up.go:300] Pod overprovision-gp-c-type-arm64-spot-68d5cd4ccc-2rxz2 can't be scheduled on eks-general-purpose-worker-arm64-spot-c-type-xlarge, predicate checking error: node(s) had untolerated taint {aws-node-termination-handler/rebalance-recommendation: rebalance-recommendation-event-39396261316437352d336166632d3337}; predicateName=TaintToleration

The workaround is either manually scaling up the ASG from 0 to 1 to refresh ASG's taint's in memory or restarting cluster-autoscaler pods to refresh everything

geosigno · 2023-07-28T14:36:49Z

any update on this i am facing the same issue.

der-eismann · 2023-08-10T09:59:53Z

Same issue here, we taint nodes when draining and before shutting them down. New nodes can't be started then because cluster-autoscaler thinks all of them have these taint.

matcasx · 2023-09-04T07:51:02Z

Same issue. The autoscaler doesn't work if the we set node count = 0 or after the nodes scales down to 0.

k8s-triage-robot · 2024-01-27T12:43:56Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

der-eismann · 2024-01-27T12:50:13Z

/remove-lifecycle stale

k8s-triage-robot · 2024-06-19T13:42:52Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

WebSpider · 2024-06-19T13:45:03Z

/remove-lifecycle stale

k8s-triage-robot · 2024-09-17T14:25:18Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

jbartosik added the area/cluster-autoscaler label May 17, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 2, 2022

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 2, 2022

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Dec 3, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 3, 2023

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 2, 2023

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Apr 8, 2023

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 27, 2024

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 27, 2024

towca added the area/core-autoscaler Denotes an issue that is related to the core autoscaler and is not specific to any provider. label Mar 21, 2024

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 19, 2024

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 19, 2024

gtirloni mentioned this issue Jul 15, 2024

Allow min_node_count to be zero vexxhost/magnum-cluster-api#408

Open

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cluster autoscaler not scaling up the autoscaling group when already downscaled to 0 #4893

cluster autoscaler not scaling up the autoscaling group when already downscaled to 0 #4893

vkkumarswamy commented May 17, 2022

WebSpider commented May 20, 2022

vkkumarswamy commented May 26, 2022

ZTGallagher commented Jul 7, 2022

rshad commented Jul 27, 2022 •

edited

Loading

DesmondH0 commented Aug 3, 2022

rshad commented Aug 4, 2022 •

edited

Loading

ZTGallagher commented Aug 4, 2022 •

edited

Loading

rshad commented Aug 4, 2022

k8s-triage-robot commented Nov 2, 2022

k8s-triage-robot commented Dec 2, 2022

WebSpider commented Dec 3, 2022

k8s-triage-robot commented Mar 3, 2023

k8s-triage-robot commented Apr 2, 2023

WebSpider commented Apr 8, 2023

dogzzdogzz commented May 30, 2023

geosigno commented Jul 28, 2023

der-eismann commented Aug 10, 2023

matcasx commented Sep 4, 2023

k8s-triage-robot commented Jan 27, 2024

der-eismann commented Jan 27, 2024

k8s-triage-robot commented Jun 19, 2024

WebSpider commented Jun 19, 2024

k8s-triage-robot commented Sep 17, 2024

cluster autoscaler not scaling up the autoscaling group when already downscaled to 0 #4893

cluster autoscaler not scaling up the autoscaling group when already downscaled to 0 #4893

Comments

vkkumarswamy commented May 17, 2022

WebSpider commented May 20, 2022

vkkumarswamy commented May 26, 2022

ZTGallagher commented Jul 7, 2022

rshad commented Jul 27, 2022 • edited Loading

DesmondH0 commented Aug 3, 2022

rshad commented Aug 4, 2022 • edited Loading

ZTGallagher commented Aug 4, 2022 • edited Loading

rshad commented Aug 4, 2022

k8s-triage-robot commented Nov 2, 2022

k8s-triage-robot commented Dec 2, 2022

WebSpider commented Dec 3, 2022

k8s-triage-robot commented Mar 3, 2023

k8s-triage-robot commented Apr 2, 2023

WebSpider commented Apr 8, 2023

dogzzdogzz commented May 30, 2023

geosigno commented Jul 28, 2023

der-eismann commented Aug 10, 2023

matcasx commented Sep 4, 2023

k8s-triage-robot commented Jan 27, 2024

der-eismann commented Jan 27, 2024

k8s-triage-robot commented Jun 19, 2024

WebSpider commented Jun 19, 2024

k8s-triage-robot commented Sep 17, 2024

rshad commented Jul 27, 2022 •

edited

Loading

rshad commented Aug 4, 2022 •

edited

Loading

ZTGallagher commented Aug 4, 2022 •

edited

Loading