Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use kubectl rollout restart on single-replica Deployments with PDB #928

Open
applejag opened this issue May 8, 2024 · 9 comments
Open

Comments

@applejag
Copy link

applejag commented May 8, 2024

Goodday! First off, really nice tool ❤️ Works exactly as advertised.

We've hit an issue where we want to have deployments with .spec.replicas: 1, but still make use of PodDisruptionBudget (PDB). Most Kubernetes discussions and Kubernetes documentation basically summarizes this as a "bad idea", usually with the argument of "if you only have 1 replica, then you have already agreed to not having it HA".

When kured evicts a pod (or when kubectl drain evicts one), it immediately terminates the existing pod before the replacement is up and running, leading to unwanted downtime/disruptions. Adding PDB is supposed to solve this, but when you only have 1 replica then the PDB will basically always report .status.disruptionsAllowed: 0, leading to dead-locks in the whole node drain and reboot procedure.

At our workplace we have the our microservice applied to two different use cases:

  1. Few big installation, handling lots of traffic. Here we can just use replicas: 2 or more, and then no issue with PDB and evictions.
  2. Many small installation, each handling a small bit of spiky traffic. Here we can set the resource limits and requests to make them smaller, but the nature of our service results in very spiky usage, so we can't shrink the resources too much. So we want to scale down to replicas: 1 to reduce resource consumption of having a lot of these smaller installations.

This issue is only about use case 2 here.

Proposed solution: use kubectl rollout restart

The kubectl rollout restart command basically just adds an annotation to the pod template in the Deployment or StatefulSet resource, forcing it to roll out a new version. In this article, they explain a way around this eviction block by making use of this rollout restart approach: https://www.artur-rodrigues.com/tech/2023/03/30/impossible-kubectl-drains.html

What I envision that kured could do is:

  • if pod has a PDB with .status.disruptionsAllowd: 0, and is owned by a Deployment with .spec.replicas: 1, then issue a rollout restart instead of eviction
  • or if pod has annotation kured.dev/strategy: rollout-restart, and is owned by a Deployment, then always use rollout restart approach on it
  • else, fallback to do regular evictions, same as before

First bullet point would be a "heuristic based approach" that should solve a lot of people's issues, without being too intrusive.

Using a rollout restart approach would restart all of the pods in the deployment, so it's not a given that the user wants to use this when replicas > 1, so for those cases the annotation provides an opt-in behavior.

@ckotzbauer
Copy link
Member

Thanks for opening this issue.
Yes, this is a widely discussed topic and we had several issues in the past for kured.
That's an interesting approach which would solve several challenges related to this. However, I'm not sure if this would fit for all statefulset cases, depending on the rollout-strategy defined.

@applejag
Copy link
Author

Yeah this would practically only help when you have single replica Deployments.

Single replica statefulsets would not really benefit from this, and in that case a PDB makes even less sense as you can't configure a statefulset to have a "surge" of temporarily increased amount of pods.

Maybe an edge case would be if you used the OpenKruise CloneSet with max surge, which is practically an extended Deployment but where you can make use of stuff like PVCs per pod. Which brings up a big downside of this approach: it's not very portable.

Should kured only have a special case for Deployment resources? Is there a way to design it so it becomes agnostic? Should kured have some special cases implemented for CustomResources like the CloneSet from OpenKruise that I mentioned earlier?

In my tunnel-visioned opinion, I think only having a special case for Deployment is fine :) Kind of because that's a solution that would help me. But as this seems to be a common thing, maybe that's fine?

There are some Kubernetes proposals on improving this whole situation, such as this one: kubernetes/enhancements#4213
But that might take a while more until the proposal is accepted, then about a year before we start seeing it in stable releases, and then wait a bit more because most Kubernetes providers intentionally stays a minor version or more behind latest for stability reasons. On top of this, the proposal might also be rejected, and we have to start all over again with a new proposal, which takes a lot of work.

Which more favor into my argument that a "hardcoded fix"/"special behavior for deployments" might be sufficient for now. Wdyt?

Are you up for the idea? And if so, would you accept a PR on it?

Copy link

This issue was automatically considered stale due to lack of activity. Please update it and/or join our slack channels to promote it, before it automatically closes (in 7 days).

@applejag
Copy link
Author

Still relevant

Copy link

github-actions bot commented Sep 9, 2024

This issue was automatically considered stale due to lack of activity. Please update it and/or join our slack channels to promote it, before it automatically closes (in 7 days).

@applejag
Copy link
Author

applejag commented Sep 9, 2024

Still relevant

@evrardjp
Copy link
Collaborator

evrardjp commented Oct 19, 2024

Do you really want to grant kured a cluster wide access to patch deployments, statefulsets, and daemonsets, instead of switching to 2 inside the minimum amount of replicas?

That does not sound like a good trajectory to me. We should aim for least privileges.
Kured should have less rights, not more.

I think however of two ways this could be achieved:

It could be achieved with limited rights with the help of another operator.
We could extend a bit the RBAC of kured, to add the rights on config maps, each config map holding the information about the node. (We reduce the rights of kured on its daemonset to counter balance the added rights, and improve security at the same time). A new drain could write into such configmap instead of actually draining, which then could be used by stakater/reloader to do its work. And there is no way to know if the rollout restart went well. So we would need to rely on timeouts and delays. Side note: We are checking the blockers before draining (not after).

Alternatively, the code could be modified to have a different way to drain, introducing this feature at the expense of security.

Overall, with the current state of Kured, I am not really fond of the approach -- all these changes and decreased security because someone with a microservice doesn't want to deploy 2 pods doesn't sound acceptable.

If we had a mechanism to alter the ordering of the actions based on user input, that might be easier, but it's not part of the code as of today.

@evrardjp
Copy link
Collaborator

If you have other ideas, I am all ears :)

@applejag
Copy link
Author

Yeah the idea doesn't sit completely well with me either. Especially since this is not a generic solution as it only applies to Deployments. This idea does not cover StatefulSets nor DaemonSets. It could be considered "too domain specific".

because someone with a microservice doesn't want to deploy 2 pods doesn't sound acceptable

I feel the need to defend this a little. Especially so that it's understood what my motivation is.

The use case is scaling down resources where we have many identical tenants/environments that we want to be as small as possible so that we can fit as many as possible on the same nodes. And scaling down CPU/memory resources vs scaling down replicas do not behave the same (unless you have uncommonly steady load with no spikes whatsoever).

The argument I see often is "well if you only have 1 replica then you're already stating that you're OK with downtime" -well not exactly, because planned downtime and unplanned downtime are two very different things and happen at two very different intervals. I don't want downtime during planned downtime, but I'm OK with unplanned downtime.

Another adjacent use case is when a (micro)service is not designed for multiple replicas. I.e when the job/task can't run in parallel. You then need to figure out how to do some kind of distributed lock via something like Redis, but then you add Redis to the stack and the idea was to scale down. Some concrete examples of this is a "garbage collector"/cleanup type of service (e.g "scan database for old records and edit or delete them"), or a service that scans something (e.g rows in a database table) to then trigger alerts/webhooks based on it (wouldn't want duplicate alerts just because you have 2 replicas). More generally it's when 2 replicas would results in either race conditions or duplicate work.


But back on topic.

Your idea of having a separate operator for this might fit well. Such as an operator that listens for admission webhooks for when a pod of a deployment is being deleted/evicted and then doing this "rollout restart" approach instead. Such a tool would not need to be related to kured at all, and could then more easily be installed on a select few namespaces.


At our workplace we had a very small selection of services where we used single replica with PDB and where we felt we needed a solution for this. We were running "unattended upgrades + kured" on a monthly schedule, which meant that our nodes were restarted once a month and so this issue affected us often. However we've since then reconsidered the impact and either increased the replica count to 2 or removed the PDB (and allowing the brief monthly downtime) on a per-service basis.

Some of the discussions we've had at our workplace of alternative solutions to scaling down is by making these services more "multi-tenant capable", where instead of having the same service duplicated in a lot of tenants/environments to instead only have it once. Then even if you do 2 (or even 3) replicas of that new multi-tenant service it's still less resources than 20 single-replica tenant/environment-specific (i.e non-multi-tenant, 1 per tenant) instances.

From my end I'm OK with you closing this issue now as "Not planned". If the need comes back to us in the future, then we might consider writing the hypothetical operator we talked about here in this thread. But I wouldn't hold my breath.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants