-
Notifications
You must be signed in to change notification settings - Fork 39.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
maxSurge for node draining or how to meet availability requirements when draining nodes by adding pods #114877
Comments
@txomon: This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/wg reliability |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
It seems @0xmichalis also advocated this on Mar 8, 2017 when drain was made to honor PodDisruptionBudgets. But is seems no one took notice of it. It seems a lot of users are frustrated/confused by this https://duckduckgo.com/?q=poddisruptionbudget+single+replica. I think it would help if the /remove-lifecycle rotten |
This is still a huge obstacle for gracefully replacing nodes. |
I don't think so, it seems like it isn't gathering much attention, I'm not sure if the working group is even aware of it... |
With a single replica and the PDB, it's not possible to cleanly evict the pod from a node. Relevant kubernetes issue: kubernetes/kubernetes#114877 (comment)
/remove-wg reliability sig apps is the primary owner of this FR due to the maxSurge feature. you can join their zoom meeting and present it: other sigs / wg should only be added once maintainers have a look at this ticket. |
It's important to understand that any controller (deployment, or any other workload) and disruption are two distinct mechanism having different roles, although their functionality is complementary when we're talking about ensuring high availability. Moving on to the subject of configuring a rolling update (in case of a deployment), versus responding to external disruptions (in this case, PDB being an external actor is equal to invoking Speaking from my own experience, we've had multiple instances of problems when PDBs were blocking the upgrades, due to users setting the allowed replicas down to bare minimum. We solved it by adding an alert which looks at PDBs and notifies administrators in those cases. Since then we haven't seen any problems, or cluster administrators were aware that problems might popup during upgrade and were able to solve these problems even before initiating the upgrades. Having said all of the above, SIG Apps meeting happens on every other Monday, next occurrence is planned for September 18. If you're interested, I'd be happy to hear more about your use cases and the problems you're struggling with. |
Hello Maciej,
I will do my best to join the monday call.
Just for other people reading the thread, my use case are all the
miscellaneous apps that are needed to run some services on top of
kubernetes (such as knative admission controller, external-dns, etc.), as
well as any other app that just doesn't need scaling.
These apps have PDBs that make sure that there is always at least a single
instance running (hence it makes sense that the minimum availability of the
PDB is 1), however running more than one instance at a time would be a
waste of resources.
I understand there is ownership and interactions between the different
controllers, and some of my ideas were revolving around creating a new
attribute, however it was brought to my attention that the `maxSurge` field
perfectly describes the situation, "I'm okay with having up to X instances
for a while".
Regarding the PDB blocking upgrade issue, I had those too, however because
we are running in GKE and nodes get recycled in a regular basis, we can't
have a cluster operator be waiting for a node maintenance to happen (nor we
want to).
I hope I was able to give enough context,
Cheers, Javier
…On Tue, Sep 5, 2023 at 1:50 PM Maciej Szulik ***@***.***> wrote:
It's important to understand that any controller (deployment, or any other
workload) and disruption are two distinct mechanism having different roles,
although their functionality is complementary when we're talking about
ensuring high availability.
I'd like to stress here the high availability factor, which is at the
front and center when talking about PDBs. I will admit it's challenging to
have any further discussion without that prerequisite fullfilled.
In a similar vein, when talking about HPAs, there's an option for minimum
number of replicas, which ensure the application always maintains the HA
pre-reqs.
Moving on to the subject of configuring a rolling update (in case of a
deployment), versus responding to external disruptions (in this case, PDB
being an external actor is equal to invoking kubectl delete by the user).
The reason we have
the ability to have a detailed rollout is coming from the fact that it is
a process fully operated by the owning controller. Whereas in all other
situations (ie. any external errors or disruptions), the controller's goal
is to reach the desired state as quickly as possible.
Speaking from my own experience, we've had multiple instances of problems
when PDBs were blocking the upgrades, due to users setting the allowed
replicas down to bare minimum. We solved it by adding an alert
<https://github.com/openshift/cluster-kube-controller-manager-operator/blob/d95b0c25ba55c4ef8e09e56461562ee60b22d51c/manifests/0000_90_kube-controller-manager-operator_05_alerts.yaml#L25-L44>
which looks at PDBs and notifies administrators in those cases. Since then
we haven't seen any problems, or cluster administrators were aware that
problems might popup during upgrade and were able to solve these problems
even before initiating the upgrades.
Having said all of the above, SIG Apps
<https://github.com/kubernetes/community/tree/master/sig-apps> meeting
happens on every other Monday, next occurrence is planned for September 18.
If you're interested, I'd be happy to hear more about your use cases and
the problems you're struggling with.
—
Reply to this email directly, view it on GitHub
<#114877 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABXXGQOBDXTKBTHRAUQZ2DXY4GZ5ANCNFSM6AAAAAATTG7BPM>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Likewise: usually when I run into this its when a low availability requirement cluster is running cluster-critical services (e.g. an admission controller) where it is a complete waste of resources to run multiple replicas, but the need to drain without manual interaction (e.g. in response to a spot instance removal) is important. I'm unable to join the SIG apps call, as the meeting time is incompatible with Australian timezones. |
For anyone following the thread, https://github.com/kubernetes/enhancements/pull/4213/files was brought forward to take into account for this situation during the sig-apps weekly. |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale Where can I follow any ongoing discussion? |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
/remove-lifecycle rotten |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
What would you like to be added?
A way to drain nodes by adding more pods elsewhere to meet PodDisruptionBudgets.
Why is this needed?
Currently, when there is a Deployment, it can be configured to have a
maxSurge
to avoid going under the amount of replicas the deployment requires while allowing for a new release to be rolled out. This parameter allows adding extra pods before subtracting the old ones so that the "replicas" number required is always met as a minimum,This feature (to my knowledge) is only available when releasing new versions of an application, however when draining nodes this would be extremely useful.
Usual cluster maintenance is done by adding new nodes before removing old ones. This means all the pods in the node need to be evicted and there is usually space for one more of each of the old node in the new node. Current solutions such as the PodDisruptionBudget or Eviction API are trying to make sure that substracting pods from the current amount don´t break anything, however the possibility of temporarily having one extra pod of each deployment is not contemplated at the moment.
This request is asking for the ability to use a surplus of pods to meet all constraints for safe eviction.
Some side notes to stress the importance. Although when operating evictions on large workloads lack of PDBs or PDBs with minAvailable/maxUnavailable settings work fine. When moving deployments with 1 replicas or HPA controlled deployments that are currently scaled down enough the problem is aggravated and can only be solved through a few inefficient means, which is acerbated if node maintenance is done automatically (such as GKE, and other cloud services)
Just in case, this is a limitation that should only be counted against Deployments with
strategy.type=RollingUpdate
.Ways to deal with this situation currently:
minReplicas
/replicas
to>1
, and a PDB withmaxUnavailable=1
when it's known that the autoscaler if in use, it's usually scaled on the lower end. Pros: There is no downtime, Cons: Waste of resources.The text was updated successfully, but these errors were encountered: