Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheduling dask worker pods #273

Closed
consideRatio opened this issue May 9, 2019 · 10 comments · Fixed by #846
Closed

Scheduling dask worker pods #273

consideRatio opened this issue May 9, 2019 · 10 comments · Fixed by #846

Comments

@consideRatio
Copy link
Member

consideRatio commented May 9, 2019

I'm quite new in the domain of pangeo and Dask, but my understanding is that the users of a pangeo jupyter pod has the rights to use dask-kubernetes to spawn dask-worker pods. I created this issue to discuss the scheduling of these pods in order to downscale properly.

It is my guess, that when these schedule, they will use the default scheduler. The default scheduler will spread the pods across the available nodes. But, you probably want to pack them tight as that allows nodes to free up and scale down.

So, I'm proposing to consider using the Z2JH helm chart's provided scheduler instead.

Example when packing pods would be important

Say that you have 100 users and they all schedule 10 pods each, and this leads to 20 powerful nodes (dedicated for worker pods) are created for example. These 100 users now all leave at the same time and suddenly there are no worker pods scheduled on these 20 powerful nodes. But then two users drop in, they both schedule 10 worker pods that are scheduled by the default scheduler which spreads them apart on the 20 nodes. Suddenly 20 nodes are utilized and may therefore not scale down...

If the z2jh user pod scheduler would been used instead, it would have packed the pods tight on one node instead, making 19 other nodes scale down keeping one running only for the two users.

Concerns of this idea

It may require the z2jh provided scheduler to schedule pods in different namespaces, is it configured for this? I don't remember. But we could probably make this configurable in the z2jh helm chart if not.

I tried evaluating this right away, but I'm still not confident about the situation. But now I know that the z2jh user pod scheduler has rights to list and schedule pods in all namespaces, but is its configuration making it available to do so? I think so but I'm not sure.

Related to the namespace concern

@jhamman
Copy link
Member

jhamman commented May 9, 2019

@consideRatio - I think you have the current jupyter/dask scaling behavior mostly right. I think we should look into this further but I'm not clear exactly how we would go about trying this. Do you have some specific ideas on how we could test this? We should feel free to try anything we want on https://hub.pangeo.io (https://github.com/pangeo-data/pangeo-cloud-federation/tree/staging/deployments/dev).

@consideRatio
Copy link
Member Author

@jhamman the key thing to change, is to make sure the worker pods get the schedulerName specified to the z2jh provided scheduler named {{ .Release.Name }}-user-scheduler. Then we can see if they schedule as they should. I bet we can manually specify this from dask-kubernetes somehow, but that would be a bit messy and I would want it to be specified by default instead.

  1. We can test this without modification to the helm chart by manually asking dask-kubernetes to adjust the pod spec to use a certain scheduler
  2. Can we influence dask-kubernetes default worker spec to include schedulerName: {{ .Release.Name }}-user-scheduler? Is there a file we can place in a certain directory that isn't the user directory that would influence it?
  3. It is technically possible, but practically not suitable, to add a kubernetes MutableAdmissionController in k8s that will modify all pod specs before letting them be scheduled, then we could look for worker pods and modify them. I really don't recommend this as it may be impossible to accomplish from a Helm chart.

I know too little about dask-kubernetes still... Looking in the examples in the docs I think what we want to do is...

from dask_kubernetes import KubeCluster, make_pod_spec
# the namespace could be dynamically fetched from an ENV var we set
helm_release_name = "jupyterhub"
pod_spec = make_pod_spec(
    extra_pod_config={
        "schedulerName": f"{helm_release_name}-user-scheduler",
    }
)
cluster = KubeCluster(pod_spec)
# trigger a cluster-autoscaler scale up to >=2 worker nodes
cluster.scale_up(10)

# remove all pods to have two empty worker nodes
cluster.scale_up(0)

# add some pods again, and see if they schedule on the same node or they spread out
# these pods should with the configured scheduler:
# --- schedule at all
# --- schedule on the same node
cluster.scale_up(5)

A good verification step along the way is to inspect the pod spec of the worker and verify it has the schedulerName specified correctly to the z2jh provided scheduler.

@jhamman
Copy link
Member

jhamman commented May 10, 2019

Thanks @consideRatio - this makes sense. I'll give it a try. I've also just invited you to the pangeo github org so you should have access to hub.pangeo.io where you could try this out yourself if you like. I can also get you setup on our GCP account if if would be useful to you going forward (just send me a dm with your google account id).

@jhamman
Copy link
Member

jhamman commented May 16, 2019

I tried this out and then realized we've turned off the user scheduler...so things didn't work:

scheduling:
userScheduler:
enabled: false
userPlaceholder:
enabled: false

I think we need to turn this back on before proceeding. I think now that we have better handling of taints/affinities, we can turn this back on and things should work much better.

@scottyhq
Copy link
Member

scottyhq commented Nov 4, 2020

@TomAugspurger - since you were doing some work on dask-gateway worker pod optimizations I wanted to revive this issue and bring it to your attention. @consideRatio 's original example of how this is important is a good one for the jupyterhub case.

Another situation where I've observed this coming up is if a single user runs a really big computation using cluster.adapt() and then trims down the number of worker pods to explore results. You can easily end up with 10 pods spread across 10 nodes instead of 10 pods packed onto one node. I opened an issue about this on dask-kubernetes a while back: dask/dask-kubernetes#233

Not sure how things stand currently with dask-gateway and whether the schedulings settings are consistent across hubs currently.

@TomAugspurger
Copy link
Member

Thanks, I am indeed looking at this right now :)

I'm hoping #807 is a start, just to ensure that we don't have cases where the requests are just over the allocatable CPU / memory because of kubernetes system pods.

I'm also concerned about the adaptive case you raise. I worry that kubernetes prefers to spread new pods out, thus preventing a node from retiring. I'm seeing if https://kubernetes.io/docs/concepts/scheduling-eviction/resource-bin-packing/ can help.

@consideRatio
Copy link
Member Author

I worry that kubernetes prefers to spread new pods out, thus preventing a node from retiring.

The k8s default pod scheduler is doing this across all versions of k8s, while the z2jh helm chart's user-scheduler packs pods on the most busy nodes by pods that has declared they want to be scheduled by it. I'm not sure if it would work to ask to be scheduled by the z2jh scheduler if you are in another namespace.

I'm seeing if https://kubernetes.io/docs/concepts/scheduling-eviction/resource-bin-packing/ can help.

This can only help assuming you deploy your own kube-scheduler binary, because only then can you configure it at the moment. In the future, it may be able to adopt based on just creating configmaps in kube-system or similar, but for now, you must deploy it yourself. This is what z2jh does.

The user-scheduler, which is a kube-scheduler configured for z2jh's need to pack user pods, is configured in two different ways depending on what k8s version z2jh is installed on. We use kube-scheduler from k8s 1.19.2 currently if you are on a k8s cluster 1.16+, and I don't want to think any more about the old configuration option which is deprecated which we used before.

Here are the configuration options we rely on to influence the scoring of nodes where a pod can be scheduled, and the rule called "NodeResourcesMostAllocated" is the one of relevance to help us pack pods.

I would strongly recommend trying to make use of the user-schedulers logic rather than deploy your own, because what you want is the same behavior I think. Perhaps you need to make it able to schedule pods in different namespaces if it can't do that already, I have not tried and isn't sure about it.

To reason if it can schedule pods from other namespaces, I inspect the permissions we grant it here. I conclude that we grant the user-scheduler a ClusterRole only, and not a Role, so I think it can very well end up being able to schedule other pods.

On this line you can see that we let the user-scheduler's kube-scheduler binary be configured with a name, it is this name that you could reference in pod's spec.schedulerName: myz2jhreleasename-user-scheduler.

Does dask-gateway have extraPodSpec? It seems so here. I think what you need to do is to configure your dask-gateway helm chart like this.

Solution suggestion

Configure dask-gateway to let its worker pods schedule with the z2jh user-scheduler.

gateway:
  backend:
    worker:
      extraPodConfig:
        schedulerName: <your-helm-release-name>-user-scheduler

@TomAugspurger
Copy link
Member

Thanks for the explanation Erik! I will try that out.

I think I will also add that config to the dask-gateway scheduler nodes. This does carry a risk that we oversubscribe the node-pool with the scheduler, since it will be placed on busy machines, but I think this would happen regardless as new worker pods show up anyway. If this is a problem in practice, then we can create a new node pool dedicated to schedulers (on GCP we're using GCE's nodepool auto-provisioner anyway).

@consideRatio
Copy link
Member Author

@TomAugspurger 🎉 =)

I think I will also add that config to the dask-gateway scheduler nodes.

It sounds good to let the dask-scheduler pod created for each user's dask cluster be either scheduled by the user-scheduler as well, or let them be scheduled on a dedicated node that doesn't risk being shut down if running with cheap premeptible/spot priced nodes.

I'm not sure how much network traffic will flow between pods etc, but I know that bigger nodes typically get better network traffic capacity, so if schedulers are low-cpu high-netowork then they may be limited on a low-cpu node network wise.

@scottyhq
Copy link
Member

scottyhq commented Nov 5, 2020

Thanks for the all the information and improvements @TomAugspurger and @consideRatio !

We use kube-scheduler from k8s 1.19.2 currently if you are on a k8s cluster 1.16+,

Just a note that we have been falling behind a bit on AWS with K8s version upgrades. BinderHub is running on 1.15 and jupyterhub on 1.16... 1.18 is now available on EKS. These new versions are released ~every 3 months. As far as I know once 1.19 is released out 1.15 cluster would be automatically upgraded due to supported versions. I'm tempted to leave things and see how that automatic upgrade pans out!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants