-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cluster Autoscaler scaling from 0 on AWS EKS stops working after CA pod is restarted #3780
Comments
do you follow all the things as mentioned here - How can I scale a node group to 0? |
@sowhatim thank you for responding, I was using multiple ASG (auto-scaling group)s that span multiple AZ(availability zone)s - I fixed it so that all ASGs only span the same AZ ... so will report back if I stop encountering this issue within the next month to close this issue |
I'm experience the same issue with one of my spot nodegroups with scale to 0 enabled and was working until yesterday. Manually scaling up as @marwan116 suggested using |
A quick update - I am still experiencing the same issue even after making sure all ASGs only span the same AZ ... |
For me same issue occurs after CA pod has restarted. (EKS 1.17, CA 1.17.4) manually setting "desired = 1" in ASG group fixes this; after this autoscaling to and from zero works for months. |
Same issue on EKS 1.19, CA 1.19.1 |
Same issue on EKS 1.18 and CA 1.18.1 via helm, I also had a restart on cluster-autoscaler-controller pod |
Thank you for the workaround |
Same issue on EKS 1.16 and CA 1.16.7. It stop working after a while. As @mvz27 point it out, setting manually the desired to 1, the autoscaler starts working to and from zero, since the AC then sets the desired to zero again. But, again, after some days, it stops working. Since this affects a lot of versions, it looks like that is something with the ASG ... |
Running into the same issue with eks 1.19 (latest version offered by aws as of May 17, 2021) running cluster autoscaler helm chart 9.4.0 with self managed worker nodes (gpu and non-GPU). As mentioned in previous posts, the issues happen when the CA pod is restarted, and setting the desired count to 1 fixes it until the next restart. |
cluster-autoscaler documentation for AWS at - https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/aws/README.md suggests adding labels and taints as tags for ASG for additional hints for cluster-autoscaler. I am able to scale up my AWS EKS deployment with managed node-group from 0 only by adding tags. |
@anandnilkal - can you please provide an update that tainting and tagging the nodegroup is still working with scaling from 0 ? (the reason I ask is this issue doesn't immediately present itself so it is hard to debug ...) When I was trying this out, I didn't have any taints on the node group so I didn't think I need to add any tags - maybe it is this edge case of untainted/untagged nodegroups that fails when scaling from 0? |
@anandnilkal - I just tried a tainted/tagged nodegroup. Scaling from 0 worked then I forced a restart of the cluster-autoscaler pod (by deleting the pod) and scaling from 0 stopped working - so the same issue remains. |
@marwan116 i will test this scenario and update in here. |
Same here. |
Got the same problem...so the only real solution (except for manually setting the desired capacity after each CA pod restart) is to leave desired at least at 1...? |
I seem to hit similar issue when using EBS CSI driver, while stateless deployments can scale fine from 0 also after restarting CA. Here is working scale up (before restarting CA):
And after CA restart:
|
absolutely same symptoms. after restart of autoscaler pod it won't see ASGs at 0 desired capacity. We have to manually push minimum to 1 and then change to 0. Does anyone have an idea of fix? |
So I think I figured out the issue in my case and it was due to using what I think is referred to as a predicate as a node affinity label to match on
and then I changed to using a custom node label and now the auto-scaler is robust to scaling from 0 after restarts
the updated cluster config file now looks like
Hope this helps |
in our case, we don't use
and the pods will have respective |
We are also experiencing this issue and are using affinities, taints, and tolerations (tainted managed node groups set up with terraform and Jenkins build agents with affinities and tolerations). We are set up exactly as the op except that we are using terraform instead of eksctl, using Helm chart version: We attempted to add a few additional replicas of the CA. Checking the logs of each shows that there is a leader election process. Our hope was that when the leader CA pod went down, it would relinquish its state lock and a new leader would be elected... at least when performing a It's also worth mentioning, that when the new CA pod came back, it resumed as leader and according to its logs, still knew about the node group in question, but simply wasn't triggering scale ups anymore. Hard to say what's going on here, but figured we would leave our observations. If this isn't helpful at all, please disregard :) |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
We're still facing this issue on AWS EKS 1.22 after a cluster-autoscaler pod restart which seems to send the component into a unstable state.
The workaround of manually scaling up nodes to 1 resets the normal behaviour of the autoscaler but is difficult to maintain in production. |
If I get this right, the problem comes from the way how AWS EKS tags ASGs under Managed Node Groups. When Managed Node Group is created with labels and taints, this MNG is tagged with helper tags, which are used by Cluster Autoscaler (as described in the Auto Discovery Setup documentation). So for example node group gets tagged with "k8s.io/cluster-autoscaler/node-template/label/:". However, the Auto Scaling Group which is created by MNG does not get tagged with helper tags, so Cluster Autoscaler cannot get information about labels, which will be assigned to new EC2, once it's created. I have tagged the ASG manually and restarted CA - after the restart CA picked up tags from ASG and scaled up my node group from 0 to 1 and deployed a pod onto it. It seems to me then, that CA is reading configuration from ASG and this is the cause of the problem. |
Just had this bug occurred to me today. @pbaranow Is right about the tag, by default the tags applied to the managed node group are not automatically applied to the autoscaling group. Manually setting the
|
@aandrushchenko this:
is wrong - it should be No need for Scaling up from 0 is working fine for us, on many node groups, even after restarting CA. I suspect that there is a configuration problem like the above in each case where it doesn't work, but maybe not always the same configuration problem. |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
I can confirm that after setting the right tags to the ASG's everything works perfectly even if the autoscaler restarts. |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close not-planned |
@k8s-triage-robot: Closing this issue, marking it as "Not Planned". In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Which component are you using?: cluster-autoscaler
What version of the component are you using?:
Component version:
1.18.3
What k8s version are you using (
kubectl version
)?: 1.18kubectl version
OutputWhat environment is this in?:
What did you expect to happen?: I have a
cluster-autoscaler
deployed on AWS EKS. It is scaling three non-managed nodegroups from 0 and is running smoothly. However after a period of time (which seems to be somewhere between 14 and 20 days) the autoscaler seems to "lose" visibility of some of the nodegroups and starts failing to find a place to schedule the pods.Please note that this behavior has been consistent over the past three months at least (over different versions of the austocaler on different versions of kubernetes - I have tried this also on kuberentes 1.17 with cluster-autoscaler version 1.17.4 and 1.17.3).
Also please note that no modifications have been made to the pod spec or the nodes that I have been using (the pods get deployed as part of a scheduled job).
To resolve this issue - I have to manually scale the nodegroup to a number of nodes that is non-zero and different from the "desired capacity" - the nodegroups then become visible again to the autoscaler and it resumes functioning properly.
Please see the following logs of when the autoscaler fails to fit a pod
Then after my manual intervention in scaling the nodegroup it works fine: some sample logs
I am not sure how to help reproduce this issue without waiting on the cluster-autoscaler to fail - I am wondering if someone else might have faced this
The text was updated successfully, but these errors were encountered: