Cluster Autoscaler scaling from 0 on AWS EKS stops working after CA pod is restarted #3780

marwan116 · 2020-12-21T20:20:49Z

Which component are you using?: cluster-autoscaler

What version of the component are you using?:

Component version: 1.18.3

What k8s version are you using (kubectl version)?: 1.18

kubectl version Output

$ kubectl version

What environment is this in?:

Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.4", GitCommit:"d360454c9bcd1634cf4cc52d1867af5491dc9c5f", GitTreeState:"clean", BuildDate:"2020-11-14T14:49:35Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"18+", GitVersion:"v1.18.9-eks-d1db3c", GitCommit:"d1db3c46e55f95d6a7d3e5578689371318f95ff9", GitTreeState:"clean", BuildDate:"2020-10-20T22:18:07Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}

What did you expect to happen?: I have a cluster-autoscaler deployed on AWS EKS. It is scaling three non-managed nodegroups from 0 and is running smoothly. However after a period of time (which seems to be somewhere between 14 and 20 days) the autoscaler seems to "lose" visibility of some of the nodegroups and starts failing to find a place to schedule the pods.

Please note that this behavior has been consistent over the past three months at least (over different versions of the austocaler on different versions of kubernetes - I have tried this also on kuberentes 1.17 with cluster-autoscaler version 1.17.4 and 1.17.3).

Also please note that no modifications have been made to the pod spec or the nodes that I have been using (the pods get deployed as part of a scheduled job).

To resolve this issue - I have to manually scale the nodegroup to a number of nodes that is non-zero and different from the "desired capacity" - the nodegroups then become visible again to the autoscaler and it resumes functioning properly.

Please see the following logs of when the autoscaler fails to fit a pod

I1221 18:43:28.403351       1 filter_out_schedulable.go:60] Filtering out schedulables
I1221 18:43:28.403424       1 filter_out_schedulable.go:116] 0 pods marked as unschedulable can be scheduled.
I1221 18:43:28.403455       1 filter_out_schedulable.go:77] No schedulable pods
I1221 18:43:28.403469       1 scale_up.go:322] Pod prefect/prefect-dask-job-3b7e73fc-1750-44b6-8802-167512cf1681-295x6 is unschedulable
I1221 18:43:28.403522       1 scale_up.go:360] Upcoming 0 nodes
I1221 18:43:28.403622       1 scale_up.go:284] Pod prefect-dask-job-3b7e73fc-1750-44b6-8802-167512cf1681-295x6 can't be scheduled on eks-54bb0f80-0061-9aef-9229-269f75a7df33, predicate checking error: node(s) didn't match node selector; predicateName=NodeAffinity; reasons: node(s) didn't match node selector; debugInfo=
I1221 18:43:28.403644       1 scale_up.go:433] No pod can fit to eks-54bb0f80-0061-9aef-9229-269f75a7df33
I1221 18:43:28.403700       1 scale_up.go:284] Pod prefect-dask-job-3b7e73fc-1750-44b6-8802-167512cf1681-295x6 can't be scheduled on eks-eabb0f7f-f1a7-83f0-3ec3-3f040fa0b843, predicate checking error: Insufficient cpu, Insufficient memory; predicateName=NodeResourcesFit; reasons: Insufficient cpu, Insufficient memory; debugInfo=
I1221 18:43:28.403732       1 scale_up.go:433] No pod can fit to eks-eabb0f7f-f1a7-83f0-3ec3-3f040fa0b843
I1221 18:43:28.403802       1 scale_up.go:284] Pod prefect-dask-job-3b7e73fc-1750-44b6-8802-167512cf1681-295x6 can't be scheduled on eksctl-production-datascience-nodegroup-prefect-m5-8xlarge-NodeGroup-I7NS8B2JZGKP, predicate checking error: node(s) didn't match node selector; predicateName=NodeAffinity; reasons: node(s) didn't match node selector; debugInfo=
I1221 18:43:28.403822       1 scale_up.go:433] No pod can fit to eksctl-production-datascience-nodegroup-prefect-m5-8xlarge-NodeGroup-I7NS8B2JZGKP
I1221 18:43:28.403883       1 scale_up.go:284] Pod prefect-dask-job-3b7e73fc-1750-44b6-8802-167512cf1681-295x6 can't be scheduled on eksctl-production-datascience-nodegroup-prefect-m5-xlarge-NodeGroup-YY863OMC4245, predicate checking error: node(s) didn't match node selector; predicateName=NodeAffinity; reasons: node(s) didn't match node selector; debugInfo=
I1221 18:43:28.403902       1 scale_up.go:433] No pod can fit to eksctl-production-datascience-nodegroup-prefect-m5-xlarge-NodeGroup-YY863OMC4245
I1221 18:43:28.403961       1 scale_up.go:284] Pod prefect-dask-job-3b7e73fc-1750-44b6-8802-167512cf1681-295x6 can't be scheduled on eksctl-production-datascience-nodegroup-prefect-r5-xlarge-NodeGroup-1IPGCCDQ4SVRT, predicate checking error: node(s) didn't match node selector; predicateName=NodeAffinity; reasons: node(s) didn't match node selector; debugInfo=
I1221 18:43:28.403982       1 scale_up.go:433] No pod can fit to eksctl-production-datascience-nodegroup-prefect-r5-xlarge-NodeGroup-1IPGCCDQ4SVRT
I1221 18:43:28.403999       1 scale_up.go:437] No expansion options
I1221 18:43:28.404049       1 static_autoscaler.go:436] Calculating unneeded nodes
I1221 18:43:28.404069       1 pre_filtering_processor.go:66] Skipping ip-192-168-24-13.us-west-2.compute.internal - node group min size reached
I1221 18:43:28.404085       1 pre_filtering_processor.go:66] Skipping ip-192-168-122-181.us-west-2.compute.internal - node group min size reached
I1221 18:43:28.404116       1 scale_down.go:421] Node ip-192-168-24-12.us-west-2.compute.internal - cpu utilization 0.653061
I1221 18:43:28.404129       1 scale_down.go:424] Node ip-192-168-24-12.us-west-2.compute.internal is not suitable for removal - cpu utilization too big (0.653061)
I1221 18:43:28.404142       1 scale_down.go:487] Scale-down calculation: ignoring 1 nodes unremovable in the last 5m0s
I1221 18:43:28.404135       1 event.go:278] Event(v1.ObjectReference{Kind:"Pod", Namespace:"prefect", Name:"prefect-dask-job-3b7e73fc-1750-44b6-8802-167512cf1681-295x6", UID:"adfc2034-0975-4435-9620-bdefc1ab1c5e", APIVersion:"v1", ResourceVersion:"11981668", FieldPath:""}): type: 'Normal' reason: 'NotTriggerScaleUp' pod didn't trigger scale-up (it wouldn't fit if a new node is added): 4 node(s) didn't match node selector, 1 Insufficient cpu, 1 Insufficient memory
I1221 18:43:28.404167       1 static_autoscaler.go:490] Scale down status: unneededOnly=false lastScaleUpTime=2020-12-18 21:08:17.497046574 +0000 UTC m=+21.122273595 lastScaleDownDeleteTime=2020-12-19 14:26:28.596956979 +0000 UTC m=+62312.222184059 lastScaleDownFailTime=2020-12-18 21:08:17.497046728 +0000 UTC m=+21.122273750 scaleDownForbidden=false isDeleteInProgress=false scaleDownInCooldown=false
I1221 18:43:28.404212       1 static_autoscaler.go:503] Starting scale down
I1221 18:43:28.404251       1 scale_down.go:867] No candidates for scale down
I1221 18:43:38.417418       1 static_autoscaler.go:226] Starting main loop
W1221 18:43:38.418114       1 aws_manager.go:315] Found multiple availability zones for ASG "eksctl-production-datascience-nodegroup-prefect-m5-8xlarge-NodeGroup-I7NS8B2JZGKP"; using us-west-2a
W1221 18:43:38.456626       1 aws_manager.go:315] Found multiple availability zones for ASG "eksctl-production-datascience-nodegroup-prefect-r5-xlarge-NodeGroup-1IPGCCDQ4SVRT"; using us-west-2a
I1221 18:43:38.513330       1 filter_out_schedulable.go:60] Filtering out schedulables
I1221 18:43:38.513352       1 filter_out_schedulable.go:116] 0 pods marked as unschedulable can be scheduled.
I1221 18:43:38.513361       1 filter_out_schedulable.go:77] No schedulable pods
I1221 18:43:38.513380       1 static_autoscaler.go:389] No unschedulable pods

Then after my manual intervention in scaling the nodegroup it works fine: some sample logs

I1221 18:53:15.740135       1 cluster.go:148] Fast evaluation: ip-192-168-124-127.us-west-2.compute.internal for removal
I1221 18:53:15.740152       1 cluster.go:168] Fast evaluation: node ip-192-168-124-127.us-west-2.compute.internal cannot be removed: pod annotated as not safe to evict present: prefect-dask-job-ee62a87c-f5ef-4ba7-b2e4-a745b5a37ba9-dgljg
I1221 18:53:15.740169       1 scale_down.go:591] 1 nodes found to be unremovable in simulation, will re-check them at 2020-12-21 18:58:15.688680268 +0000 UTC m=+251419.313907325
I1221 18:53:15.740231       1 static_autoscaler.go:479] ip-192-168-62-233.us-west-2.compute.internal is unneeded since 2020-12-21 18:48:32.64088555 +0000 UTC m=+250836.266112601 duration 4m43.047794724s
I1221 18:53:15.740261       1 static_autoscaler.go:490] Scale down status: unneededOnly=false lastScaleUpTime=2020-12-18 21:08:17.497046574 +0000 UTC m=+21.122273595 lastScaleDownDeleteTime=2020-12-19 14:26:28.596956979 +0000 UTC m=+62312.222184059 lastScaleDownFailTime=2020-12-18 21:08:17.497046728 +0000 UTC m=+21.122273750 scaleDownForbidden=false isDeleteInProgress=false scaleDownInCooldown=false
I1221 18:53:15.740280       1 static_autoscaler.go:503] Starting scale down
I1221 18:53:15.740348       1 scale_down.go:790] ip-192-168-62-233.us-west-2.compute.internal was unneeded for 4m43.047794724s
I1221 18:53:15.740378       1 scale_down.go:867] No candidates for scale down
I1221 18:53:15.740423       1 delete.go:193] Releasing taint {Key:DeletionCandidateOfClusterAutoscaler Value:1608576512 Effect:PreferNoSchedule TimeAdded:<nil>} on node ip-192-168-124-127.us-west-2.compute.internal
I1221 18:53:15.751685       1 delete.go:220] Successfully released DeletionCandidateTaint on node ip-192-168-124-127.us-west-2.compute.internal
I1221 18:53:25.764228       1 static_autoscaler.go:226] Starting main loop
I1221 18:53:25.894081       1 auto_scaling_groups.go:351] Regenerating instance to ASG map for ASGs: [eks-54bb0f80-0061-9aef-9229-269f75a7df33 eks-eabb0f7f-f1a7-83f0-3ec3-3f040fa0b843 eksctl-production-datascience-nodegroup-prefect-m5-8xlarge-NodeGroup-I7NS8B2JZGKP eksctl-production-datascience-nodegroup-prefect-m5-xlarge-NodeGroup-YY863OMC4245 eksctl-production-datascience-nodegroup-prefect-r5-xlarge-NodeGroup-1IPGCCDQ4SVRT]
I1221 18:53:26.006657       1 auto_scaling.go:199] 5 launch configurations already in cache
I1221 18:53:26.006951       1 aws_manager.go:269] Refreshed ASG list, next refresh after 2020-12-21 18:54:26.006946035 +0000 UTC m=+251189.632173088
W1221 18:53:26.007852       1 aws_manager.go:315] Found multiple availability zones for ASG "eksctl-production-datascience-nodegroup-prefect-r5-xlarge-NodeGroup-1IPGCCDQ4SVRT"; using us-west-2a
I1221 18:53:26.044308       1 filter_out_schedulable.go:60] Filtering out schedulables
I1221 18:53:26.044327       1 filter_out_schedulable.go:116] 0 pods marked as unschedulable can be scheduled.
I1221 18:53:26.044335       1 filter_out_schedulable.go:77] No schedulable pods
I1221 18:53:26.044355       1 static_autoscaler.go:389] No unschedulable pods
I1221 18:53:26.044368       1 static_autoscaler.go:436] Calculating unneeded nodes
I1221 18:53:26.044382       1 pre_filtering_processor.go:66] Skipping ip-192-168-122-181.us-west-2.compute.internal - node group min size reached
I1221 18:53:26.044399       1 pre_filtering_processor.go:66] Skipping ip-192-168-24-13.us-west-2.compute.internal - node group min size reached
I1221 1

I am not sure how to help reproduce this issue without waiting on the cluster-autoscaler to fail - I am wondering if someone else might have faced this

The text was updated successfully, but these errors were encountered:

rtudo · 2021-01-06T11:49:02Z

do you follow all the things as mentioned here - How can I scale a node group to 0?

marwan116 · 2021-01-07T02:48:38Z

@sowhatim thank you for responding, I was using multiple ASG (auto-scaling group)s that span multiple AZ(availability zone)s - I fixed it so that all ASGs only span the same AZ ... so will report back if I stop encountering this issue within the next month to close this issue

knkarthik · 2021-01-14T17:01:42Z

I'm experience the same issue with one of my spot nodegroups with scale to 0 enabled and was working until yesterday. Manually scaling up as @marwan116 suggested using eksctl makes it visible to autscaler again. I'm on v1.18.3. My nodegroups span multiple AZs and I didn't have any issues until today morning.

marwan116 · 2021-01-14T23:58:53Z

A quick update - I am still experiencing the same issue even after making sure all ASGs only span the same AZ ...
I am using a config template based on https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/aws/examples/cluster-autoscaler-autodiscover.yaml

mvz27 · 2021-01-29T09:41:53Z

For me same issue occurs after CA pod has restarted. (EKS 1.17, CA 1.17.4)
kubectl describe pod clusterscaler-aws-cluster shows restart 3 days ago.
Since then autoscale from zero does not work, get errors:
predicate failed: GeneralPredicates predicate mismatch, reason: node(s) didn't match node selector,

manually setting "desired = 1" in ASG group fixes this; after this autoscaling to and from zero works for months.
until restart of CA pod ;-(

tbondarchuk · 2021-02-21T13:17:56Z

Same issue on EKS 1.19, CA 1.19.1

andreaspoldi · 2021-03-31T22:06:46Z

Same issue on EKS 1.18 and CA 1.18.1 via helm, I also had a restart on cluster-autoscaler-controller pod

chaitushiva · 2021-04-08T02:31:52Z

For me same issue occurs after CA pod has restarted. (EKS 1.17, CA 1.17.4)
kubectl describe pod clusterscaler-aws-cluster shows restart 3 days ago.
Since then autoscale from zero does not work, get errors:
predicate failed: GeneralPredicates predicate mismatch, reason: node(s) didn't match node selector,

manually setting "desired = 1" in ASG group fixes this; after this autoscaling to and from zero works for months.
until restart of CA pod ;-(

Thank you for the workaround

andre-lx · 2021-04-22T17:35:35Z

Same issue on EKS 1.16 and CA 1.16.7.

It stop working after a while.

As @mvz27 point it out, setting manually the desired to 1, the autoscaler starts working to and from zero, since the AC then sets the desired to zero again.

But, again, after some days, it stops working.

Since this affects a lot of versions, it looks like that is something with the ASG ...

ankitm123 · 2021-05-17T13:50:38Z

Running into the same issue with eks 1.19 (latest version offered by aws as of May 17, 2021) running cluster autoscaler helm chart 9.4.0 with self managed worker nodes (gpu and non-GPU). As mentioned in previous posts, the issues happen when the CA pod is restarted, and setting the desired count to 1 fixes it until the next restart.

anandnilkal · 2021-05-17T14:04:18Z

cluster-autoscaler documentation for AWS at - https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/aws/README.md suggests adding labels and taints as tags for ASG for additional hints for cluster-autoscaler.

I am able to scale up my AWS EKS deployment with managed node-group from 0 only by adding tags.
referred - #2434

marwan116 · 2021-05-31T16:11:33Z

@anandnilkal - can you please provide an update that tainting and tagging the nodegroup is still working with scaling from 0 ? (the reason I ask is this issue doesn't immediately present itself so it is hard to debug ...)

When I was trying this out, I didn't have any taints on the node group so I didn't think I need to add any tags - maybe it is this edge case of untainted/untagged nodegroups that fails when scaling from 0?

marwan116 · 2021-05-31T16:44:27Z

@anandnilkal - I just tried a tainted/tagged nodegroup. Scaling from 0 worked then I forced a restart of the cluster-autoscaler pod (by deleting the pod) and scaling from 0 stopped working - so the same issue remains.

anandnilkal · 2021-05-31T18:11:56Z

@marwan116 i will test this scenario and update in here.

fernandonogueira · 2021-07-04T21:42:07Z

Same here.
Manually setting "desired = 1" in ASG group fixes this until next CA pod restart.

v1nc3nt27 · 2021-07-09T14:44:51Z

Got the same problem...so the only real solution (except for manually setting the desired capacity after each CA pod restart) is to leave desired at least at 1...?

jfoechsler · 2021-07-12T17:50:45Z

I seem to hit similar issue when using EBS CSI driver, while stateless deployments can scale fine from 0 also after restarting CA.
So I don't know what we are doing differently in regards to that part.
Using CA 1.20.0.

Here is working scale up (before restarting CA):

I0712 17:14:02.459471       1 static_autoscaler.go:229] Starting main loop
I0712 17:14:02.462369       1 filter_out_schedulable.go:65] Filtering out schedulables
I0712 17:14:02.462453       1 filter_out_schedulable.go:132] Filtered out 0 pods using hints
I0712 17:14:02.462564       1 scheduler_binder.go:795] PersistentVolume "pvc-e9b59588-ece6-4149-9a2c-37535de4215d", Node "ip-10-30-101-8.ec2.internal" mismatch for Pod "godoc/godoc-0": no matching NodeSelectorTerms
I0712 17:14:02.462620       1 filter_out_schedulable.go:170] 0 pods were kept as unschedulable based on caching
I0712 17:14:02.462647       1 filter_out_schedulable.go:171] 0 pods marked as unschedulable can be scheduled.
I0712 17:14:02.462677       1 filter_out_schedulable.go:82] No schedulable pods
I0712 17:14:02.462710       1 klogx.go:86] Pod godoc/godoc-0 is unschedulable
I0712 17:14:02.462773       1 scale_up.go:364] Upcoming 0 nodes
I0712 17:14:02.462855       1 scale_up.go:288] Pod godoc-0 can't be scheduled on eks-32bd4d51-1bff-6422-d9b1-5285154769e5, predicate checking error: node(s) didn't match Pod's node affinity; predicateName=NodeAffinity; reasons: node(s) didn't match Pod's node affinity; debugInfo=
I0712 17:14:02.462915       1 scale_up.go:437] No pod can fit to eks-32bd4d51-1bff-6422-d9b1-5285154769e5
I0712 17:14:02.463021       1 scheduler_binder.go:775] Could not get a CSINode object for the node "template-node-for-eks-5abd4d6f-b827-fb3f-f6c4-0feeb58a49f8-6579891967708768566": csinode.storage.k8s.io "template-node-for-eks-5abd4d6f-b827-fb3f-f6c4-0feeb58a49f8-6579891967708768566" not found
I0712 17:14:02.463076       1 scheduler_binder.go:801] All bound volumes for Pod "godoc/godoc-0" match with Node "template-node-for-eks-5abd4d6f-b827-fb3f-f6c4-0feeb58a49f8-6579891967708768566"
I0712 17:14:02.464413       1 scale_up.go:288] Pod godoc-0 can't be scheduled on eks-66bd4d51-1c5e-3129-797f-c01bc6cbc269, predicate checking error: node(s) didn't match Pod's node affinity; predicateName=NodeAffinity; reasons: node(s) didn't match Pod's node affinity; debugInfo=
I0712 17:14:02.464560       1 scale_up.go:437] No pod can fit to eks-66bd4d51-1c5e-3129-797f-c01bc6cbc269
I0712 17:14:02.465044       1 scale_up.go:288] Pod godoc-0 can't be scheduled on eks-d4bd4d51-1bfc-37c3-ec88-e7af9b8f3ae2, predicate checking error: node(s) didn't match Pod's node affinity; predicateName=NodeAffinity; reasons: node(s) didn't match Pod's node affinity; debugInfo=
I0712 17:14:02.465142       1 scale_up.go:437] No pod can fit to eks-d4bd4d51-1bfc-37c3-ec88-e7af9b8f3ae2
I0712 17:14:02.465575       1 scale_up.go:288] Pod godoc-0 can't be scheduled on eks-d8bd4d51-1bf9-ba76-c57f-1acad6704b09, predicate checking error: node(s) didn't match Pod's node affinity; predicateName=NodeAffinity; reasons: node(s) didn't match Pod's node affinity; debugInfo=
I0712 17:14:02.465650       1 scale_up.go:437] No pod can fit to eks-d8bd4d51-1bf9-ba76-c57f-1acad6704b09
I0712 17:14:02.465790       1 scheduler_binder.go:775] Could not get a CSINode object for the node "template-node-for-eks-e4bd4d6f-b7ac-ac76-03a6-8608708f9ed5-8119258595974552162": csinode.storage.k8s.io "template-node-for-eks-e4bd4d6f-b7ac-ac76-03a6-8608708f9ed5-8119258595974552162" not found
I0712 17:14:02.465871       1 scheduler_binder.go:795] PersistentVolume "pvc-e9b59588-ece6-4149-9a2c-37535de4215d", Node "template-node-for-eks-e4bd4d6f-b7ac-ac76-03a6-8608708f9ed5-8119258595974552162" mismatch for Pod "godoc/godoc-0": no matching NodeSelectorTerms
I0712 17:14:02.465942       1 scale_up.go:288] Pod godoc-0 can't be scheduled on eks-e4bd4d6f-b7ac-ac76-03a6-8608708f9ed5, predicate checking error: node(s) had volume node affinity conflict; predicateName=VolumeBinding; reasons: node(s) had volume node affinity conflict; debugInfo=
I0712 17:14:02.466007       1 scale_up.go:437] No pod can fit to eks-e4bd4d6f-b7ac-ac76-03a6-8608708f9ed5
I0712 17:14:02.466770       1 priority.go:118] Successfully loaded priority configuration from configmap.
I0712 17:14:02.466851       1 priority.go:166] priority expander: eks-5abd4d6f-b827-fb3f-f6c4-0feeb58a49f8 chosen as the highest available
I0712 17:14:02.467137       1 scale_up.go:456] Best option to resize: eks-5abd4d6f-b827-fb3f-f6c4-0feeb58a49f8
I0712 17:14:02.467215       1 scale_up.go:460] Estimated 1 nodes needed in eks-5abd4d6f-b827-fb3f-f6c4-0feeb58a49f8
I0712 17:14:02.467654       1 scale_up.go:574] Final scale-up plan: [{eks-5abd4d6f-b827-fb3f-f6c4-0feeb58a49f8 0->1 (max: 10)}]
I0712 17:14:02.467719       1 scale_up.go:663] Scale-up: setting group eks-5abd4d6f-b827-fb3f-f6c4-0feeb58a49f8 size to 1
I0712 17:14:02.467844       1 event_sink_logging_wrapper.go:48] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"kube-system", Name:"cluster-autoscaler-status", UID:"d9799aff-9f5a-4af9-a35e-2f7a1388a2e8", APIVersion:"v1", ResourceVersion:"53409654", FieldPath:""}): type: 'Normal' reason: 'ScaledUpGroup' Scale-up: setting group eks-5abd4d6f-b827-fb3f-f6c4-0feeb58a49f8 size to 1
I0712 17:14:02.468714       1 auto_scaling_groups.go:219] Setting asg eks-5abd4d6f-b827-fb3f-f6c4-0feeb58a49f8 size to 1
I0712 17:14:02.659496       1 eventing_scale_up_processor.go:47] Skipping event processing for unschedulable pods since there is a ScaleUp attempt this loop
I0712 17:14:02.659689       1 event_sink_logging_wrapper.go:48] Event(v1.ObjectReference{Kind:"Pod", Namespace:"godoc", Name:"godoc-0", UID:"db3b9a21-bce2-4f57-82b0-1443ea5e7be7", APIVersion:"v1", ResourceVersion:"53409662", FieldPath:""}): type: 'Normal' reason: 'TriggeredScaleUp' pod triggered scale-up: [{eks-5abd4d6f-b827-fb3f-f6c4-0feeb58a49f8 0->1 (max: 10)}]

And after CA restart:

I0712 16:58:43.980356       1 static_autoscaler.go:229] Starting main loop
I0712 16:58:43.983440       1 filter_out_schedulable.go:65] Filtering out schedulables
I0712 16:58:43.983474       1 filter_out_schedulable.go:132] Filtered out 0 pods using hints
I0712 16:58:43.984013       1 scheduler_binder.go:795] PersistentVolume "pvc-e9b59588-ece6-4149-9a2c-37535de4215d", Node "ip-10-30-101-8.ec2.internal" mismatch for Pod "godoc/godoc-0": no matching NodeSelectorTerms
I0712 16:58:43.984061       1 filter_out_schedulable.go:170] 0 pods were kept as unschedulable based on caching
I0712 16:58:43.984355       1 filter_out_schedulable.go:171] 0 pods marked as unschedulable can be scheduled.
I0712 16:58:43.984665       1 filter_out_schedulable.go:82] No schedulable pods
I0712 16:58:43.985094       1 klogx.go:86] Pod godoc/godoc-0 is unschedulable
I0712 16:58:43.985549       1 scale_up.go:364] Upcoming 0 nodes
I0712 16:58:43.986417       1 scale_up.go:288] Pod godoc-0 can't be scheduled on eks-32bd4d51-1bff-6422-d9b1-5285154769e5, predicate checking error: node(s) didn't match Pod's node affinity; predicateName=NodeAffinity; reasons: node(s) didn't match Pod's node affinity; debugInfo=
I0712 16:58:43.986507       1 scale_up.go:437] No pod can fit to eks-32bd4d51-1bff-6422-d9b1-5285154769e5
I0712 16:58:43.986640       1 scheduler_binder.go:775] Could not get a CSINode object for the node "template-node-for-eks-5abd4d6f-b827-fb3f-f6c4-0feeb58a49f8-7288025506317767450": csinode.storage.k8s.io "template-node-for-eks-5abd4d6f-b827-fb3f-f6c4-0feeb58a49f8-7288025506317767450" not found
I0712 16:58:43.986716       1 scheduler_binder.go:795] PersistentVolume "pvc-e9b59588-ece6-4149-9a2c-37535de4215d", Node "template-node-for-eks-5abd4d6f-b827-fb3f-f6c4-0feeb58a49f8-7288025506317767450" mismatch for Pod "godoc/godoc-0": no matching NodeSelectorTerms
I0712 16:58:43.986786       1 scale_up.go:288] Pod godoc-0 can't be scheduled on eks-5abd4d6f-b827-fb3f-f6c4-0feeb58a49f8, predicate checking error: node(s) had volume node affinity conflict; predicateName=VolumeBinding; reasons: node(s) had volume node affinity conflict; debugInfo=
I0712 16:58:43.986852       1 scale_up.go:437] No pod can fit to eks-5abd4d6f-b827-fb3f-f6c4-0feeb58a49f8
I0712 16:58:43.988550       1 scale_up.go:288] Pod godoc-0 can't be scheduled on eks-66bd4d51-1c5e-3129-797f-c01bc6cbc269, predicate checking error: node(s) didn't match Pod's node affinity; predicateName=NodeAffinity; reasons: node(s) didn't match Pod's node affinity; debugInfo=
I0712 16:58:43.988647       1 scale_up.go:437] No pod can fit to eks-66bd4d51-1c5e-3129-797f-c01bc6cbc269
I0712 16:58:43.988755       1 scale_up.go:288] Pod godoc-0 can't be scheduled on eks-d4bd4d51-1bfc-37c3-ec88-e7af9b8f3ae2, predicate checking error: node(s) didn't match Pod's node affinity; predicateName=NodeAffinity; reasons: node(s) didn't match Pod's node affinity; debugInfo=
I0712 16:58:43.988821       1 scale_up.go:437] No pod can fit to eks-d4bd4d51-1bfc-37c3-ec88-e7af9b8f3ae2
I0712 16:58:43.988940       1 scale_up.go:288] Pod godoc-0 can't be scheduled on eks-d8bd4d51-1bf9-ba76-c57f-1acad6704b09, predicate checking error: node(s) didn't match Pod's node affinity; predicateName=NodeAffinity; reasons: node(s) didn't match Pod's node affinity; debugInfo=
I0712 16:58:43.989015       1 scale_up.go:437] No pod can fit to eks-d8bd4d51-1bf9-ba76-c57f-1acad6704b09
I0712 16:58:43.989162       1 scheduler_binder.go:775] Could not get a CSINode object for the node "template-node-for-eks-e4bd4d6f-b7ac-ac76-03a6-8608708f9ed5-7333524577874715059": csinode.storage.k8s.io "template-node-for-eks-e4bd4d6f-b7ac-ac76-03a6-8608708f9ed5-7333524577874715059" not found
I0712 16:58:43.989239       1 scheduler_binder.go:795] PersistentVolume "pvc-e9b59588-ece6-4149-9a2c-37535de4215d", Node "template-node-for-eks-e4bd4d6f-b7ac-ac76-03a6-8608708f9ed5-7333524577874715059" mismatch for Pod "godoc/godoc-0": no matching NodeSelectorTerms
I0712 16:58:43.989310       1 scale_up.go:288] Pod godoc-0 can't be scheduled on eks-e4bd4d6f-b7ac-ac76-03a6-8608708f9ed5, predicate checking error: node(s) had volume node affinity conflict; predicateName=VolumeBinding; reasons: node(s) had volume node affinity conflict; debugInfo=
I0712 16:58:43.989378       1 scale_up.go:437] No pod can fit to eks-e4bd4d6f-b7ac-ac76-03a6-8608708f9ed5

aandrushchenko · 2021-07-16T19:41:01Z

absolutely same symptoms. after restart of autoscaler pod it won't see ASGs at 0 desired capacity. We have to manually push minimum to 1 and then change to 0. Does anyone have an idea of fix?

marwan116 · 2021-07-16T20:23:55Z

So I think I figured out the issue in my case and it was due to using what I think is referred to as a predicate as a node affinity label to match on
i.e. I was using the alpha.eksctl.io/nodegroup-name:

  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: alpha.eksctl.io/nodegroup-name
            operator: In
            values:
            - my-nodegroup

and then I changed to using a custom node label and now the auto-scaler is robust to scaling from 0 after restarts

  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: my-label
            operator: In
            values:
            - my-nodegroup

the updated cluster config file now looks like

- availabilityZones:
  - us-west-2a
  desiredCapacity: 1
  iam:
    withAddonPolicies:
      autoScaler: true
      ebs: true
  instanceType: m5.8xlarge
  maxSize: 100
  minSize: 0
  name: my-nodegroup
  volumeSize: 100
  labels:
    my-label: "my-nodegroup"
  tags:
    k8s.io/cluster-autoscaler/node-template/label/my-label: "my-nodegroup"

Hope this helps

aandrushchenko · 2021-07-16T20:35:59Z

in our case, we don't use nodeAffinity, we have nodeSelector and tolerations set. worker groups are provisioned with terraform as:

        "--node-labels=app=gitlab-runner,node.kubernetes.io/lifecycle=spot",
        "--register-with-taints app=gitlab-runner:NoSchedule"
      ]),
...
tags = [
{
          "key"                 = "k8s.io/cluster-autoscaler/node-template/app"
          "propagate_at_launch" = "true"
          "value"               = "gitlab-runner"
        },
        {
          "key"                 = "k8s.io/cluster-autoscaler/node-template/taint/app"
          "propagate_at_launch" = "true"
          "value"               = "gitlab-runner:NoSchedule"
        },...
]

and the pods will have respective nodeSelector and toleration

j2udev · 2021-08-11T15:44:17Z

We are also experiencing this issue and are using affinities, taints, and tolerations (tainted managed node groups set up with terraform and Jenkins build agents with affinities and tolerations). We are set up exactly as the op except that we are using terraform instead of eksctl, using Helm chart version: 9.9.2 which corresponds to app version: 1.20.0

We attempted to add a few additional replicas of the CA. Checking the logs of each shows that there is a leader election process. Our hope was that when the leader CA pod went down, it would relinquish its state lock and a new leader would be elected... at least when performing a kubectl delete my-leader-ca-pod and letting the deployment replace it, that was not the case. The other, non-leader, pods actually mentioned that the leader did not relinquish state. We have not verified if the leader CA goes down on its own, if it is doing so gracefully and successfully relinquishing state. I can try to update once we verify.

It's also worth mentioning, that when the new CA pod came back, it resumed as leader and according to its logs, still knew about the node group in question, but simply wasn't triggering scale ups anymore. Hard to say what's going on here, but figured we would leave our observations.

If this isn't helpful at all, please disregard :)

k8s-triage-robot · 2021-12-14T16:02:12Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

varkey · 2022-01-12T07:10:24Z

/remove-lifecycle stale

k8s-triage-robot · 2022-04-12T08:05:20Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

fvlaicu · 2022-05-05T11:57:42Z

/remove-lifecycle stale

dipen-epi · 2022-08-01T04:51:40Z

We're still facing this issue on AWS EKS 1.22 after a cluster-autoscaler pod restart which seems to send the component into a unstable state.

leaderelection.go:361] Failed to update lock: etcdserver: request timed out
main.go:457] lost master
leaderelection.go:278] failed to renew lease kube-system/cluster-autoscaler: timed out waiting for the condition

The workaround of manually scaling up nodes to 1 resets the normal behaviour of the autoscaler but is difficult to maintain in production.
Is there a recommended fix to this?

pbaranow · 2022-08-02T13:45:24Z

If I get this right, the problem comes from the way how AWS EKS tags ASGs under Managed Node Groups. When Managed Node Group is created with labels and taints, this MNG is tagged with helper tags, which are used by Cluster Autoscaler (as described in the Auto Discovery Setup documentation). So for example node group gets tagged with "k8s.io/cluster-autoscaler/node-template/label/:".

However, the Auto Scaling Group which is created by MNG does not get tagged with helper tags, so Cluster Autoscaler cannot get information about labels, which will be assigned to new EC2, once it's created.

I have tagged the ASG manually and restarted CA - after the restart CA picked up tags from ASG and scaled up my node group from 0 to 1 and deployed a pod onto it. It seems to me then, that CA is reading configuration from ASG and this is the cause of the problem.

AlexandreBrown · 2022-08-09T19:34:51Z

Just had this bug occurred to me today.
We had my-nodegroup-1, deleted it, recreated it with a different maxSize and the autoscaler wouldn't scale from 0 for this newly created managed nodegroup.

@pbaranow Is right about the tag, by default the tags applied to the managed node group are not automatically applied to the autoscaling group.
In our case this wasn't the issue tho since we have a CI task that automatically adds the necessary autoscaler tags to the autoscaling group (we use taints, tolerations, labels).

Manually setting the desiredSize to 1 fixed it, after that the autoscaler works as expected for some reason.

eksctl scale nodegroup --cluster YOUR_CLUSTER_NAME --name YOUR_NODE_GROUP_NAME --nodes 1

jbg · 2022-08-27T15:32:46Z

@aandrushchenko this:

        {
          "key"                 = "k8s.io/cluster-autoscaler/node-template/app"
          "propagate_at_launch" = "true"
          "value"               = "gitlab-runner"
        }

is wrong - it should be k8s.io/cluster-autoscaler/node-template/label/app.

No need for propagate_at_launch to be true either — the tags need to be on the ASGs but don't need to be on the instances.

Scaling up from 0 is working fine for us, on many node groups, even after restarting CA.

I suspect that there is a configuration problem like the above in each case where it doesn't work, but maybe not always the same configuration problem.

k8s-triage-robot · 2022-11-25T16:24:03Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

andre-lx · 2022-11-25T17:52:41Z

I can confirm that after setting the right tags to the ASG's everything works perfectly even if the autoscaler restarts.

k8s-triage-robot · 2022-12-25T18:13:23Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2023-01-24T18:45:49Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot · 2023-01-24T18:45:55Z

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

marwan116 added the kind/bug Categorizes issue or PR as related to a bug. label Dec 21, 2020

knkarthik mentioned this issue Jan 14, 2021

Doesn't scale up ASG from 0 with Mixed Instance Policy #3535

Closed

marwan116 changed the title ~~Cluster Autoscaler on AWS EKS stops working after a period of time (14-20 days)~~ Cluster Autoscaler scaling from 0 on AWS EKS stops working after CA pod is restarted May 31, 2021

dany74q mentioned this issue Aug 12, 2021

Added node readiness grace time & node-info cache expiration #4258

Closed

jbartosik added the area/cluster-autoscaler label Sep 15, 2021

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 14, 2021

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 12, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 12, 2022

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 5, 2022

adrienthebo mentioned this issue Sep 8, 2022

[self-hosted/docs] GKE (and potentially EKS) scale to zero with node labels might not scale up gitpod-io/gitpod#12782

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 25, 2022

NissesSenap mentioned this issue Dec 8, 2022

AWS Add custom tags to match node labels XenitAB/terraform-modules#887

Merged

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 25, 2022

k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Jan 24, 2023

rwilson-release mentioned this issue Apr 22, 2024

[Bug] Scale from Zero does not work on managed nodegroups even with propagateASGTags enabled eksctl-io/eksctl#7543

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster Autoscaler scaling from 0 on AWS EKS stops working after CA pod is restarted #3780

Cluster Autoscaler scaling from 0 on AWS EKS stops working after CA pod is restarted #3780

marwan116 commented Dec 21, 2020 •

edited

Loading

rtudo commented Jan 6, 2021

marwan116 commented Jan 7, 2021

knkarthik commented Jan 14, 2021

marwan116 commented Jan 14, 2021

mvz27 commented Jan 29, 2021 •

edited

Loading

tbondarchuk commented Feb 21, 2021

andreaspoldi commented Mar 31, 2021

chaitushiva commented Apr 8, 2021 •

edited

Loading

andre-lx commented Apr 22, 2021

ankitm123 commented May 17, 2021

anandnilkal commented May 17, 2021

marwan116 commented May 31, 2021 •

edited

Loading

marwan116 commented May 31, 2021

anandnilkal commented May 31, 2021

fernandonogueira commented Jul 4, 2021

v1nc3nt27 commented Jul 9, 2021

jfoechsler commented Jul 12, 2021

aandrushchenko commented Jul 16, 2021

marwan116 commented Jul 16, 2021 •

edited

Loading

aandrushchenko commented Jul 16, 2021

j2udev commented Aug 11, 2021 •

edited

Loading

k8s-triage-robot commented Dec 14, 2021

varkey commented Jan 12, 2022

k8s-triage-robot commented Apr 12, 2022

fvlaicu commented May 5, 2022

dipen-epi commented Aug 1, 2022

pbaranow commented Aug 2, 2022

AlexandreBrown commented Aug 9, 2022

jbg commented Aug 27, 2022

k8s-triage-robot commented Nov 25, 2022

andre-lx commented Nov 25, 2022

k8s-triage-robot commented Dec 25, 2022

k8s-triage-robot commented Jan 24, 2023

k8s-ci-robot commented Jan 24, 2023

Cluster Autoscaler scaling from 0 on AWS EKS stops working after CA pod is restarted #3780

Cluster Autoscaler scaling from 0 on AWS EKS stops working after CA pod is restarted #3780

Comments

marwan116 commented Dec 21, 2020 • edited Loading

rtudo commented Jan 6, 2021

marwan116 commented Jan 7, 2021

knkarthik commented Jan 14, 2021

marwan116 commented Jan 14, 2021

mvz27 commented Jan 29, 2021 • edited Loading

tbondarchuk commented Feb 21, 2021

andreaspoldi commented Mar 31, 2021

chaitushiva commented Apr 8, 2021 • edited Loading

andre-lx commented Apr 22, 2021

ankitm123 commented May 17, 2021

anandnilkal commented May 17, 2021

marwan116 commented May 31, 2021 • edited Loading

marwan116 commented May 31, 2021

anandnilkal commented May 31, 2021

fernandonogueira commented Jul 4, 2021

v1nc3nt27 commented Jul 9, 2021

jfoechsler commented Jul 12, 2021

aandrushchenko commented Jul 16, 2021

marwan116 commented Jul 16, 2021 • edited Loading

aandrushchenko commented Jul 16, 2021

j2udev commented Aug 11, 2021 • edited Loading

k8s-triage-robot commented Dec 14, 2021

varkey commented Jan 12, 2022

k8s-triage-robot commented Apr 12, 2022

fvlaicu commented May 5, 2022

dipen-epi commented Aug 1, 2022

pbaranow commented Aug 2, 2022

AlexandreBrown commented Aug 9, 2022

jbg commented Aug 27, 2022

k8s-triage-robot commented Nov 25, 2022

andre-lx commented Nov 25, 2022

k8s-triage-robot commented Dec 25, 2022

k8s-triage-robot commented Jan 24, 2023

k8s-ci-robot commented Jan 24, 2023

marwan116 commented Dec 21, 2020 •

edited

Loading

mvz27 commented Jan 29, 2021 •

edited

Loading

chaitushiva commented Apr 8, 2021 •

edited

Loading

marwan116 commented May 31, 2021 •

edited

Loading

marwan116 commented Jul 16, 2021 •

edited

Loading

j2udev commented Aug 11, 2021 •

edited

Loading