Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Doing a Kubernetes upgrade if JupyterHub is running will stall the Kubernetes upgrade #1575

Closed
kmhughes opened this issue Feb 13, 2020 · 11 comments

Comments

@kmhughes
Copy link

I have installed the JupyterHub chart 0.8.2.

I was doing an automatic upgrade of my Kubernetes cluster on GKE. Eventually the update stalled out. After some research I found that the node running the Jupyter Hub Hub pod was not updating. The Pod Disruption Budget for the Hub pod requires a minimum of 1 instance of the pod. This meant the single instance of the Hub could not be shut down because that would mean 0 instances of the Hub.

The deployment.yaml template for the Hub only allows for a single instance of the Hub (replicas: 1) and there is no way in the values.yaml file to specify the number of replicas desired.

jupyterhub/templates/hub/deployment.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: hub
  labels:
    {{- include "jupyterhub.labels" . | nindent 4 }}
spec:
  replicas: 1  <= NOTICE CANNOT BE CHANGED

This means if you enable a pod disruption budget, you cannot drain the node running the Hub pod unless you have a minimum value of 0 in your pod disruption budget.

Please make it so that the number of replicas of the Hub pod can be specified in a values.yaml file. You might also want to mention having a pod disruption budget enabled with only 1 replica will lead to this problem.

I have not checked if the other pods have similar issues.

@manics manics transferred this issue from jupyterhub/helm-chart Feb 14, 2020
@manics
Copy link
Member

manics commented Feb 14, 2020

This sounds like a valid use case. I've transferred this issue to the zero-to-jupyterhub repo which is where development happens. https://github.com/jupyterhub/helm-chart is for storing and publishing the charts.

@kmhughes
Copy link
Author

kmhughes commented Feb 14, 2020

@manics Thank you!

@kmhughes
Copy link
Author

If you like, I could create a pull request with the change.

@betatim
Copy link
Member

betatim commented Feb 14, 2020

Right now you can not run more than one JupyterHub pod at a time because of shared state. There are several people who are interested in changing this but this is a significant project so this will take some time.

In the mean time I think the thing to do is to document that automatic upgrades or other operations that require automatic pod relocation won't work and what to do in this situation. I think it is better to document it and tell users that they need to explicitly delete the pod. This way the admin has full control over when the brief interruption to users will happen by choosing the moment when they delete the pod.

@kmhughes
Copy link
Author

OK, good to know. I may disable the pod disruption budget for now.

Definitely an important thing to document, I waited for over an hour for that node to drain before I figured out something was wrong. I have't upgraded the cluster that often, so didn't fully understand if there would be a lot of variability in draining or not. Once JupyterHub was down, the nodes were pretty much clockwork.

@betatim
Copy link
Member

betatim commented Feb 17, 2020

Agreed on the documentation as it is a pitfall/source of frustration/confusion. Do you want to open a PR to add this?

@kmhughes
Copy link
Author

kmhughes commented Feb 17, 2020 via email

@consideRatio
Copy link
Member

consideRatio commented Oct 7, 2020

I'm starting to think that we should disable our PDBs by default unless with have two replicas, like for our user-scheduler pods. I don't see when PDB does more good than harm for the hub, proxy, autohttps pod atm.

@betatim what do you think at this point?

@consideRatio
Copy link
Member

I don't see a clear win in allow the hub replicas be configurable, as it may indicate that it would make sense to increase it, and it would typically lead to not obvious runtime errors having multiple replicas, and to temporary make it zero one can do kubectl edit deploy/hub quicker than an helm upgrade I figure.

Unless i see a strong benefit of a PDB for the hub/proxy/autohttps pod, i think they should be allowed to disrupted during upgrades etc, I find that to be easier than to need to manually delete the pods. Anyone making a k8s version upgrade of JupyterHub should be aware that it will cause disruptions since we are not a HA helm chart, so then I'd say its better to just disrupt quickly without fuzz.

@consideRatio
Copy link
Member

Referencing a quote from #1649 (comment)

I also would like to have a simple active/passive based failover model so I can automate upgrades of nodes and clusters without having to modify the PDB or force evict the hub pod.

@consideRatio
Copy link
Member

consideRatio commented Feb 16, 2021

This is closed by a change to PDBs along with #1934 in #1938!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants