Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate pdb setups #202

Closed
yuvipanda opened this issue Feb 5, 2021 · 12 comments
Closed

Investigate pdb setups #202

yuvipanda opened this issue Feb 5, 2021 · 12 comments

Comments

@yuvipanda
Copy link
Member

pdb has been disabled - we should file an issue to investigate
if we want it - jupyterhub/zero-to-jupyterhub-k8s#1938

I'm strongly opinionated after having researched and considered this in depth.

I find there is no point enabling a PDB to enforce either 1 replica availability or 0 replica unavailability on pods not supporting having two separate replicas on - this was the old z2jh default.

Note also that mybinder.org-deploy has explicitly opted out of the old z2jh default, and systematically avoids having behavior like the old z2jh default for all other deployments with 1 replica. This was discussed in jupyterhub/mybinder.org-deploy#1730 (comment) also.

Originally posted by @consideRatio in #194 (comment)

@yuvipanda
Copy link
Member Author

So the behavior I want is that no node that contains a hub or proxy pod should ever be automatically deleted (from autoscaling, automatic upgrades, etc) without manual intervention. @consideRatio do you know how we can accomplish that?

@consideRatio
Copy link
Member

Assuming all automatic deletions will respect PDBs which only work by denying eviction requests, then you do it in the way I was strongly opinionated against ;)

But, assuming you want specifically to stop automatic deletion with regards to cluster downscaling (but not manual k8s node upgrades, or automated maintenance windows doing that), you preferably block cluster downscaling by adding an annotation to make the cluster downscaler don't try to drain a node with such pod on it in the first place.

Btw, I'm not confident a maintenance window upgrade will repspect the PDBs indefinitively, perhaps they only try to for 10 minutes or so and then go with it for example.

@yuvipanda
Copy link
Member Author

so I think the hub won't be able to run with multiple replicas for a long, long time :) Until then, I wanna make sure that there's no unannounced downtime for hubs, as much as possible. I do that with the following:

  1. Run a regional cluster, so master upgrades don't disrupt hubs
  2. Run a core node pool
  3. Turn off autoupgrade for the core node pool, so all upgrades are done manually
  4. Enable PDBs, so the core nodes don't get autoscaled down if they have hub or proxy pods. I'm ok autoscaling them for other pods that might exist on them (prometheus, grafana, etc). This seems to work across cloud providers without any configuration.

What do you think of this goal, and the current steps for accomplishing it?

@consideRatio
Copy link
Member

consideRatio commented Feb 5, 2021

@yuvipanda I just want to say that I really appreciate your part of the dialogue about this with me. You just keep finding very constructive paths onwards even though it can be tricky when someone is strongly opinionated about something. I really appreciate that!

The goal

I wanna make sure that there's no unannounced downtime for hubs, as much as possible.

Sounds good!

How to achieve the goal
I like step 1-3, while I would be perhaps willing to compromise on step 1 for cost efficiency as regional costs more now days on GKE I think. Downtime on the k8s api-server isn't so crucial as it will still allow the hub to respond and proxy to work etc, but the drawback would be the spawning of user-servers during the downtime only I think.

But regarding the PDBs in point four, I suggest replacing them with the strategy of annotating the hub/proxy pod to make the cluster autoscaler not scale them down. By doing that, we can still manually ask the cloud to do a manual node pool upgrade without also needing to do manually delete the hub/proxy pods when needed whenever that upgrade manually invoked node pool upgrade get stuck.

hub:
  annotations:
    "cluster-autoscaler.kubernetes.io/safe-to-evict": "false"
    
proxy:
  annotations:
    "cluster-autoscaler.kubernetes.io/safe-to-evict": "false"

Here is the reference on the annotation:
https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#what-types-of-pods-can-prevent-ca-from-removing-a-node

@yuvipanda
Copy link
Member Author

yuvipanda commented Feb 5, 2021

Glad to work through this, @consideRatio :) Long term solution is to make things HA, of course...

Is that annotation supported by upstream autoscaler?

@consideRatio
Copy link
Member

@yuvipanda woops updated my reply after you responded with some feedback on the goal itself and points 1-3. Also included a reference about the annotation.

I'm not 100% on what all cloud providers use to autoscale nodes, but I think they typically use that cluster autoscaler, perhaps with their own fork with smaller changes to it.

@yuvipanda
Copy link
Member Author

yuvipanda commented Feb 17, 2021

I've now come to the conclusion that @consideRatio is right, and we shouldn't have any pdbs for hub / proxy. We have a separate core node pool that won't scale down to 0, and so the pdb isn't needed for that. Its presence causes issues in node upgrades, replacements, etc - so not worth keeping.

Currently, I see we have pdbs for user placeholders & user schedulers. Do you think we should keep those, @consideRatio?

Thank you for patiently working with me on this :)

@consideRatio
Copy link
Member

The PDB for user-placeholder pods is to give permission for the pods to be disturbed without restrictions, it is a good default to keep.

The PDB for the user-scheduler which is HA can make sense to have as long as we have the default of 2 replicas set on the user-scheduler deployment, as it ensure there is always one running.

I want to make the following change to the z2jh user-scheduler PDB though, so it will be fine to have the PDB enabled by default even though you lower the replicas to 1.

     pdb:
       enabled: true
-      minAvailable: 1
+      maxUnavailable: 1

@yuvipanda
Copy link
Member Author

Makes sense! Thanks for the explanation :)

I want to run just one scheduler per cluster, since they'll all be configured the same! That's unrelated to this though.

@consideRatio
Copy link
Member

consideRatio commented Feb 17, 2021

Yeah it may be a bit overkill to have user-scheduler run with two replicas. They use leader-election so in practice it is only one doing the work and the other is standing by to do work if the other fails I think.

I created a PR in z2jh btw: jupyterhub/zero-to-jupyterhub-k8s#2039

@yuvipanda
Copy link
Member Author

The new defaults for z2jh are the right thing to do now. Thanks for pushing this thorugh, @consideRatio!

@consideRatio
Copy link
Member

It's a pleasure working with you @yuvipanda, I'm very thankful for your communication skills!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants