Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deployments with 1 replica and a PodDisruptionBudget enabled by default prevent node drain operations from succeeding #1934

Closed
DaniJG opened this issue Dec 7, 2020 · 26 comments · Fixed by #1938
Labels
Milestone

Comments

@DaniJG
Copy link

DaniJG commented Dec 7, 2020

Bug description

Installing the chart with the default values results in workloads such as the hub created with a hardcoded replica: 1 and a PodDisruptionBudget with minAvailable: 1.

kubectl get poddisruptionbudget --all-namespaces
NAMESPACE    NAME               MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
jupyterhub   hub                1               N/A               0                     191d
jupyterhub   proxy              1               N/A               0                     191d
jupyterhub   user-placeholder   0               N/A               0                     191d
jupyterhub   user-scheduler     1               N/A               0    

This prevents operations such as draining the node with kured (see kubereboot/kured#82) which require the workload to be redeployed to a different node, killing that single replica in the process. Any operation that needs to kill that single replica will conflict with the PodDisruptionBudget that is defined by default.

You can see in the kured's output the following error repeated again and again:

time="2020-12-07T11:59:35Z" level=info msg="evicting pod jupyterhub/hub-6dddf4fdc7-mv8jg" cmd=/usr/bin/kubectl std=out
time="2020-12-07T11:59:35Z" level=warning msg="error when evicting pod \"hub-6dddf4fdc7-mv8jg\" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget." cmd=/usr/bin/kubectl std=err
time="2020-12-07T11:59:40Z" level=info msg="evicting pod jupyterhub/hub-6dddf4fdc7-mv8jg" cmd=/usr/bin/kubectl std=out
time="2020-12-07T11:59:40Z" level=warning msg="error when evicting pod \"hub-6dddf4fdc7-mv8jg\" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget." cmd=/usr/bin/kubectl std=err

The only way to redeploy such workload and automated processes such as node drains or kured reboots is by manually removing the PDB. (See https://kubernetes.io/docs/tasks/run-application/configure-pdb/#think-about-how-your-application-reacts-to-disruptions)

Expected behaviour

I would expect the PodDisruptionBudgets to be disabled by default (or to include a minAvailable: 0 by default). This way installing Jupyter using its default values will not prevent nodes from being drained (or similar operations that might require redeploying the workload)

Actual behaviour

by default jupyter will deploy both the hub and proxy workloads with one replica and a PodDisrutionBudget that includes minAvailable: 1. This prevents maintenance operations like a controller kured reboot (which needs to drain the node) since any operation killing the container will fail.

How to reproduce

Follow these steps:

  1. Install and start minikube
  2. Install kured using its official chart
    $ helm repo add kured https://weaveworks.github.io/kured
    $ helm install kured kured/kured --set configuration.period=1m
    
  3. Install jupyter using this chart
    $ helm repo add jupyterhub https://jupyterhub.github.io/helm-chart/
    $ helm repo update
    $ helm install jupyter jupyterhub/jupyterhub --set proxy.secretToken=$(openssl rand -hex 32)
    
  4. Run minikube ssh
  5. Run sudo touch /var/run/reboot-required
  6. Check the logs of the kured container. Note how the drain operation is consistently being attempted and failing due to the default PodDisruptionBudget
    $ kubectl logs kured-8ndbm --tail 10 -f
    ...
    error when evicting pod "proxy-569d78d49c-wc6db" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
    evicting pod default/user-scheduler-8664cb5984-hctvs
    error when evicting pod "user-scheduler-8664cb5984-hctvs" (will retry after 5s): Cannot evict pod as it would violate the 
    pod's disruption budget.
    ...
    

If you want, cleanup your minikube environment with:

$ helm delete kured
$ helm delete jupyter

Your personal set up

Minikube:

$ minikube version
minikube version: v1.15.1

Jupyter chart

$ helm show chart jupyterhub/jupyterhub
apiVersion: v1
appVersion: 1.2.2
description: Multi-user Jupyter installation
home: https://z2jh.jupyter.org
icon: https://jupyter.org/assets/hublogo.svg
kubeVersion: '>=1.14.0-0'
maintainers:
- email: [email protected]
  name: Erik Sundell
- name: Simon Li
  url: https://github.com/manics/
name: jupyterhub
sources:
- https://github.com/jupyterhub/zero-to-jupyterhub-k8s
version: 0.10.6
- kured chart

kured chart

$ helm show chart kured/kured
apiVersion: v1
appVersion: 1.5.1
description: A Helm chart for kured
home: https://github.com/weaveworks/kured
icon: https://raw.githubusercontent.com/weaveworks/kured/master/img/logo.png
maintainers:
- email: [email protected]
  name: ckotzbauer
- email: [email protected]
  name: davidkarlsen
name: kured
sources:
- https://github.com/weaveworks/kured
version: 2.2.1
@DaniJG DaniJG added the bug label Dec 7, 2020
@welcome
Copy link

welcome bot commented Dec 7, 2020

Thank you for opening your first issue in this project! Engagement like this is essential for open source projects! 🤗

If you haven't done so already, check out Jupyter's Code of Conduct. Also, please try to follow the issue template as it helps other other community members to contribute more effectively.
welcome
You can meet the other Jovyans by joining our Discourse forum. There is also an intro thread there where you can stop by and say Hi! 👋

Welcome to the Jupyter community! 🎉

@betatim
Copy link
Member

betatim commented Dec 8, 2020

Stopping the hub pod will cause down time, because there can only be one of them, the first pod (probably) has to stop completely before the new one can be started and it can take "a long time" (seconds?) for a new hub to start up. This means as a hub administrator you don't want your hub pod getting restarted unless you say so.

I don't know the history of the current setup but because of the above I could imagine that it is setup like this on purpose.

Because the answer to "do I care if my hub pod gets restarted at arbitrary times" is "it depends" I think making the default be "make the admin do something and maybe they'll read the docs to understand why they are being asked to do something" would be a good strategy. So the question is: do we have docs people can find and read when they wonder "why is this hub pod special and getting in the way?". If not maybe we should create those instead of changing the PDB.

The docs would explain why the default is what it is, what it means for the hub pod to restart, maybe give a recommendation for cases when you might not care about the downsides and a recipe for how to configure things so that the hub doesn't get in the way of auto scaling events (or alternative setups).

What do you think?

@consideRatio
Copy link
Member

consideRatio commented Dec 8, 2020

Thank you so much for such a thorough issue report @DaniJG ❤️ 🎉!

Discussion

This is discussed also in #1575. I lean towards agreeing that this is unwanted default behavior as we don't support HA hub/proxy/autohttps pods, and that we should not create a PDBs by default for our 1 replica deployment resources.

@betatim @yuvipanda @minrk @manics, do you know if there is a sensible point to have a PDB for a deployment resource like the Hub, Proxy, and Autohttps pods in this Helm chart? I don't have a valid point motivating that any more, I consider them to have a valid point only if you have a Deployment with multiple replicas.

I opined on this in the linked issue and still hold the opinion that...

Unless i see a strong benefit of a PDB for the hub/proxy/autohttps pod, i think they should be allowed to disrupted during [k8s] upgrades etc, I find that to be easier than to need to manually delete the pods. Anyone making a k8s version upgrade of JupyterHub should be aware that it will cause disruptions since we are not a HA helm chart, so then I'd say its better to just disrupt quickly without fuzz.

History

The PDBs for hub/proxy were added here: #394. I believe the idea was that it would be a "please don't disrupt me" associated by adding the PDB, but this "please don't disrupt me" now blocks kubectl drain <node> for example which is in my mind unwanted.

I think the case for having a PDB for our hub/proxy pod (and autohttps pod which doesn't haveone) is for situations when you have a k8s Deployment based pod on a node that can be autoscaled away by relocating the pod to another node. Then, a PDB could block that, and the node wouldn't be scaled away. In my mind though, for our single Deployment pods, this should be handled in advance by node pool planning by locating the single replica Deployment pods on dedicated nodes that don't scale up and down generally. This is discussed in https://zero-to-jupyterhub.readthedocs.io/en/latest/administrator/optimization.html#scaling-down-efficiently for example.

(EDIT: new) Suggested action point

(EDIT: old) Suggested action point

  • We set hub.pdb.enabled and proxy.pdb.enabled to false by default and release it as part of a minor Chart version bump.

Alternative action point

  • We provide a motivation for the current default value being what it is, and warn users about it ahead of time so they don't run into this issue unexpectedly.

@DaniJG
Copy link
Author

DaniJG commented Dec 8, 2020

IMHO without an HA setup, the PDB is more disruptive than beneficial. By definition, a non-HA setup is expected to have some downtime (as little as that might be) during maintenance. Therefore blocking the maintenance operations as the default approach, seems a bit too aggressive.

In certain cases, blocking those maintenance operations can have a more adverse effect than the little operation downtime the PDB tried to avoid in the first place. (for instance, that might prevent a critical security patch to be applied).

My vote would be for updating the default as false. And as long as the PDB option is there, it would be useful to update the docs so they document when/why you might want to enable the PDB and the caveats you need to watch out for.

@manics
Copy link
Member

manics commented Dec 8, 2020

I'm inclined to agree with @DaniJG, I can't see what benefit a PDB offers when you've only got a single pod. Requiring additional manual work to perform maintenance is a burden, and goes against the CI/CD/infrastructure-as-code paradigm.

I think we could perhaps improve our docs to make it clear what types of configuration changes may lead to downtime?

@betatim
Copy link
Member

betatim commented Dec 10, 2020

As a hub operator I prefer having to manually intervene for things that cause (short) downtimes instead of having them happen at random times. This lets you choose when the disruption will be so you can either announce it or perform it outside of core hours.

I don't think there is a way to help with this by planning your node pools carefully. Yes you could make a "core" pool and locate the pods there, but then you will still have to manually perform maintenance on these core nodes. Which defeats the goal of this being automatic.

I think making "how to make my hub/proxy/etc pod disruptable" discoverable and documented and keeping the default as "no disruption" is the way to go. This means novice users of k8s will not have "weird outages" that are hard to debug. Expert users will quickly notice that the PDBs are preventing their kured from performing maintenance or that kubectl drain didn't work. So if we can make it easy for them to then find some docs that explain why it is this way, give them a 👍 for "yes it is as easy as disabling the PDBs" and remind them what the consequences of this might be, I would go with this option.

Unfortunately the hub pod (and friends) are not cattle, they are pets. This is why I think you should be required to opt-in to them being treated like cattle (relocated, disrupted, etc). Either by calling kubectl delete pod ("opt-in" each time) or by disabling the PDBs ("one time opt-in").

@consideRatio
Copy link
Member

I spent three hours investigating this in depth and have now opened #1937 where I make the case for the deletion of the PDBs part of this chart not supporting HA.

In short, its because PDBs are meant for use with HA applications. We introduced PDBs for our non-HA applications with intent to avoid disruptions caused by cluster autoscaler scaling down I believe, but that could have been resolved in another way without having downsides like raised in this issue of blocking kubectl drain in-definitively that I assume nobody argues is a desired outcome.

I suggest we continue the discussion on what to do here rather than move to the PR.

@manics
Copy link
Member

manics commented Dec 10, 2020

Thanks for the explanation on #1937. Based on what you've said I think removing them completely makes sense. I understand @betatim's point about it providing a way to prevent inadvertent downtime, but it sounds like we'd be misusing PDBs for this purpose, and the more we deviate from established K8S norms the more confusion it's likely to cause.

@chicocvenancio
Copy link
Contributor

FWIW I'm with @betatim on this.

@consideRatio on #1937 (comment)
As an indication of this, see the k8s docs on "configure-pdb" which has a "Before you begin" section that sais "You are the owner of an application runniong on a Kubernetes cluster that requires high availability.
ref: https://kubernetes.io/docs/tasks/run-application/configure-pdb/

True, but later on the same docs mention exactly the use case we have and provide the solution mentioned above.

"Decide how many instances can be down at the same time for a short period due to a voluntary disruption. (...)

  • Single-instance Stateful Application:
    • Concern: do not terminate this application without talking to me.
      • Possible Solution 1: Do not use a PDB and tolerate occasional downtime.
      • Possible Solution 2: Set PDB with maxUnavailable=0. Have an understanding (outside of Kubernetes) that the cluster operator needs to consult you before termination. When the cluster operator contacts you, prepare for downtime, and then delete the PDB to indicate readiness for disruption. Recreate afterwards."

ref: https://kubernetes.io/docs/tasks/run-application/configure-pdb/#think-about-how-your-application-reacts-to-disruptions

While having an option without PDB seems sensible, and docs would be useful, removing the PDB entirely seems counterproductive to me.

@crookedstorm
Copy link

Cluster operator fly-on-the-wall here.

@chicocvenancio Is the hub truly a stateful application, though? I don't think this principle really applies unless it is, and while there is some state there (logins?), it doesn't meet the same definition as a state machine or database would. I might be wrong, and am genuinely asking.

Draining nodes is essential to minimizing downtime in clusters. The PDB did not block deleting the pod on my last cluster upgrade. It only blocked draining, which was frustrating and served no useful purpose that I could find. I had to open another terminal and delete the pod to proceed. I personally favor removing PDB unless we are using it for persistent state or for an HA application.

@manics
Copy link
Member

manics commented Dec 10, 2020

Could the differing views be due in part to the relationship between a k8s cluster and JupyterHub?

BinderHub and the zero2jupyterhub guide mostly assume JupyterHub runs on it's own dedicated cluster, and that the k8s admin is also responsible for all applications. From this perspective requiring a manual override for updates is fine as they'll be familiar with JupyterHub.

On the other hand if you have a large k8s cluster shared between multiple groups, for instance segregated by namespace, the k8s admins most likely won't be familiar with the applications that are running. Depending on the urgency of the cluster maintenance work this may mean calling the JupyterHub admin out of hours, or forcibly killing the pod anyway if it's now blocking other applications.

@consideRatio
Copy link
Member

Is the hub truly a stateful application, though?

@crookedstorm yes. I don't know all details but yes. I don't remember where to best link you for more information at this point, but @minrk has provided a good overview in either this repo or the jupyterhub repo in an issue/pr or two I believe. I'm having a go at trying to summarize the statefulness of the various pods anyhow.

There will be various bugs following trying to run hub/proxy/autohttps pods in parallel. JupyterHub itself has assumptions that breaks when run in parallel due to internal state. When JupyterHub works against the proxy pod that is running a NodeJS application called ConfigurableHttpProxy, JupyterHub will also assume there is only one such proxy and configure only one and not all of them with routes to users if there would be two for example. Finally, the autohttps pod will more often than not fail if run in parallel with another when it comes to the initial automatic TLS certificate acquisition and when it is to be renewed because of a challenge may be sent from one instance but the incoming challenge request could be directed to the other instance.


@chicocvenancio can you elaborate and be a bit more explicitly on your use case for the PDBs? I'm not sure what the use case you had for the PDBs of hub/proxy pod. Let's try make the cases more explicit so it becomes easier to reason about it all.

Use cases for PDBs of hub/proxy pod

Use case 1 - to make cluster admin ask for approval first

This use case was described in the k8s docs.

Possible Solution 2: Set PDB with maxUnavailable=0. Have an understanding (outside of Kubernetes) that the cluster operator needs to consult you before termination. When the cluster operator contacts you, prepare for downtime, and then delete the PDB to indicate readiness for disruption. Recreate afterwards.

Use case 2 - to block down-scaling by cluster-autoscaler

For this case specifically, if it is wanted, I think an annotation should be applied to block the cluster-autoscaler from relocating the pod instead as using a PDB to block the cluster-autoscaler will have other outcomes as well such as not being able to use kubectl drain as it will get stuck and require manual intervention.

Use case 3 - ???

@chicocvenancio
Copy link
Contributor

I guess it can count as use case 1, only cluster and jupyterhub admin might be the same person. While we have TLJH now, I still think of this chart and its defaults, as a user-friendly, as-foolproof-as-possible, guide to setting up jupyterhub on kubernetes. Leaving the hub pod susceptible to voluntary disruptions seems to go against that.

@DaniJG
Copy link
Author

DaniJG commented Dec 10, 2020

For additional context, I encountered this issue in a situation similar to what @manics describes, except the k8s admins are looking after several clusters with dozens of teams and projects.

  • Kured is being used to precisely control when each cluster can support the necessary disruption to apply OS patches, and to minimize that disruption.
  • Each cluster can define its own schedule where nodes are allowed to be rebooted, including delaying based on labels/prometheus.
  • When necessary, kured performs a graceful reboot one node at a time, following a cordon-drain-reboot process.

I see the reasoning behind having admins asking for approval if downtime is needed. However, the same reasoning can be made in reverse. You could argue that admins expect anyone wanting to avoid maintenance downtime to use HA setups, rather than blocking the maintenance.

Both arguments have merit. And I am sure different folks/teams will prefer one or the other depending on their context and use case. However only one can be the default in the chart. IMHO the chart needs to choose between either:

  • as an owner of the jupyterhub chart, you assume you will have downtime during maintenance, since its not HA. You read this in the docs and plan accordingly with your admins, hopefully using tools like kured
  • as an owner of the jupyterhub chart, you assume no maintenance can happen without your approval, to avoid disrupting your non-HA app. Unless you told them (in which case they will ask you to remove the PDB), admins eventually find about this during maintenance and ask you to remove the PDB

Its a little skewed, but if admin/app-owner are different persons or teams, it wont be far from reality.

@consideRatio
Copy link
Member

consideRatio commented Dec 10, 2020

Thanks for the discussion!

I don't feel there is agreement enough to go forward with the deletion of the non-HA associated PDBs, so I'll transform my PR to not deleting the PDBs in question, but making them disabled by default.

In order to truly support the case of choosing to accept blocking of kubectl drain, I'm also adding a disabled-by-default PDB for the autohttps deployment. Because, if not, it would cause a disruption while the other pods would still block kubectl drain, resulting in the worst of both compromises.

I'm also adding a PDB for autohttps deployment which is missing one. By doing that, I'm also relocating the configuration of the proxy Deployment's pdb which is now under proxy.pdb to proxy.chp.pdb (proxy pod) and create a new configuration at proxy.traefik.pdb (autohttps pod) - this is done to align with how other configuration is done for these pods.

@consideRatio
Copy link
Member

I've created #1938, like before, let's continue discussion here about what to do and let discussion in the PR only be about lines of code etc.

@consideRatio consideRatio added this to the 0.11.0 milestone Dec 11, 2020
@betatim
Copy link
Member

betatim commented Dec 13, 2020

I agree that if your cluster operator isn't also a hub admin/owner they will be annoyed by this behaviour. No matter what we do here I'd want to talk to my cluster admins to discuss the fact that my hub isn't HA and that it needs some care when restarting stuff. Either by agreeing an acceptable maintenance window or that they should call me or that (in an emergency) they should just go ahead.

I also think that the default should be to assume that a z2jh chart user is someone with little k8s experience and little k8s support. This means either keep the PDBs as is or "cluster-autoscaler.kubernetes.io/safe-to-evict": "false" annotation as the default. In addition a config to disable the PDBs for the situation where you do have more experience and/or the cluster is operated by someone else with more experience.

It seems the potential for time spent going in circles is much larger for inexperienced users if the PDBs are removed. In the situation where you have more experience or an admin team the out come would be that you install the hub chart, don't talk to anyone about it (because you don't know about the PDB), get paged once at some crazy hour and then get told by your admin team that your PDBs are preventing maintenance. It might still take some time for you to work out what to do next, but the problem has been pinpointed and with more docs targeted at exactly this situation Z2JH could help with reducing the time to resolution.

For me the two cases are a bit like (1) a executable randomly crashing under load when it runs out of memory or (2) crashing with an explicit error message that it couldn't allocate more memory. In (1) you don't have anything to google and in (2) we can have a page ready "so your app is crashing because it can't allocate new memory.". This isn't the best simile, the point I am trying to make is that we should make it easy to google for solutions when things go wrong, which means avoiding "random or hard to correlate behaviour" in favour of errors/crashes/getting stuck with an explicit error message.

Question: if we kill the CHP and hub pod at the same time, is that worse in terms of recovery time than if only one goes down? I know the hub recovers some of it state from talking to the proxy or vice versa.

@consideRatio
Copy link
Member

I also think that the default should be to assume that a z2jh chart user is someone with little k8s experience and little k8s support. This means either keep the PDBs as is or "cluster-autoscaler.kubernetes.io/safe-to-evict": "false" annotation as the default.

The default of blocking evictions from autoscaler and kubectl drain in general, is what i consider to be an advanced optimization, which only sometimes is wanted.

@betatim whats easier for the admin with low experience in my mind, would be to not make kubectl drain / autoscaler fail by default. It is also what most helm charts default to, at least all ive seen so far for 1 replica charts, and that makes it the expected behavior also for experienced admins i would say.

The question you raised is relevant and merit mitigation if needed no matter what, but i would suggest we mitigate it independent of this choice such as with preferred anti-pod affinity between proxy/hub or perhaps graceperiod differences etc. It would still be a motivation for not disabling the PDBs by default if simultaneous shutdown of the pods would be more disruptive if there is onle one node for the core pods in the clusters though.

I'm recognize this a compromise no nmatter what, but im very strongly in favor of disabling them by default with the core motivation documented in #1938.

@manics
Copy link
Member

manics commented Dec 14, 2020

I think one of the strongest argument for not enabling the PDB be default is it goes against other Helm charts. As a sysadmin or developer I find it incredibly frustrating when I run into a problem or bug which is due to a library or other application going against established norms. The designer of that library/application may well have had a good reason for that choice, but that reasoning is often only known by existing users who are very familiar with it.

This doesn't mean a PDB can never be enabled by default, but I think the barrier to doing so should be higher than simply asking "is it better convenient for novice users or existing sysadmins". Instead I'd ask "is it so much better for users that it justifies going against the norms expected by most K8S admins, and the potential inconvenience to those people". As much as I like JupyterHub I don't think it does.

Rather than putting a novice user in a position where they can expect an angry call in the middle of the night from a sysadmin (which, if they're a novice user, will probably come as a complete surprise and they won't know how to respond) I think it's better to fit in with the whole K8S ecosystem.

@consideRatio
Copy link
Member

consideRatio commented Dec 16, 2020

When using GKE to upgrade a node pool, it sais the following:

Upgrading nodes in the node pool. A single node upgrade typically takes 4-5 minutes, but may be slower (e.g., due to pod disruption budget or grace period). Learn more

image

I note how they say may be slower rather than may fail which it would for our default configuration of JupyterHub.


Let's discuss this issue/pr in the team meeting tomorrow so this doesn't go stale.

@yuvipanda
Copy link
Collaborator

yuvipanda commented Dec 17, 2020

fwiw, I use this alias to drain nodes:

# Drain a node that is probably running jupyter things
k-drain() {
    # We always wanna ignore daemonsets, duh
    # Disabling evictions lets us ignore PodDisruptionBudgets
    # delete-local-data lets us get rid of pods with emptyDirs
    # force is needed to delete user pods - they aren't managed by anything else
    kubectl drain --ignore-daemonsets --disable-eviction --delete-local-data --force $@
}

@consideRatio
Copy link
Member

consideRatio commented Dec 20, 2020

@yuvipanda oh I had not learned about the --disable-evictions flag!

      --disable-eviction=false: Force drain to use delete, even if eviction is supported. This will bypass checking
PodDisruptionBudgets, use with caution.

      --force=false: Continue even if there are pods not managed by a ReplicationController, ReplicaSet, Job, DaemonSet
or StatefulSet.

By bypassing PDBs, we may terminate multiple pods running in HA all at once, making the HA deployment have downtime. In the case of nginx-ingress-controllers part of mybinder.org-deploy or similar, this would mean a major disruption for everything in the k8s cluster.

I think by not having the maximally restrictive PDB that we ship with as default for our single replica deployments, you would not need the --disable-eviction flag to successfully drain nodes while still respecting PDBs which serves an important purpose for HA deployments to slow down the drain to a pace that helps the HA deployment remain HA.

@consideRatio
Copy link
Member

I opened #1951 to represent the wish to have better documentation about HA as part of this change of PDB defaults.

@yuvipanda
Copy link
Collaborator

yeah, drain will always have disruptions for that node - I treat it like the node is gonna die... To be truly HA, you'll need to make sure your pods are on different nodes, with podAntiAffinity.

To be safer, you can use the following:

# 2 minutes for anything HA with PDB to move to other nodes
kubectl drain --ignore-daemonsets --timeout=2m --delete-local-data --force $@
# After two minutes, just kill everything
kubectl drain --ignore-daemonsets --disable-eviction --delete-local-data --force $@

@consideRatio
Copy link
Member

I guess there are levels of HA

  • One can be HA to voluntary disruptions (respecting evictions) with multiple replicas even if they are on the same node thanks to PDBs declaring that at least one needs to be available.
  • One can be HA to node crashes or disruptions not respecting PDBs as well by having multiple replicas scheduled on separate nodes and enforcing that through a required-pod-anti-affinity

For reference, I recently investigated how we run mybinder.org-deploy, and conclude that we systematically have avoided having a PDB that with minAvailable: 0 while also having only 1 replica.

hub, proxy, binderhub, matomo, analytics-publisher, federation-redirect, redirector, proxy-patches - none of these deployments as part of the mybinder.org Helm chart has opted for the 1 replica 0 minAvailable setup as we defaulted to before #1938.

@yuvipanda
Copy link
Collaborator

Despite my original reservations, based now on experience I've come to conclude that removing the pdbs was the right thing to do. Thanks to everyone for making it happen!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment