Scheduling dask worker pods #273

consideRatio · 2019-05-09T04:34:05Z

I'm quite new in the domain of pangeo and Dask, but my understanding is that the users of a pangeo jupyter pod has the rights to use dask-kubernetes to spawn dask-worker pods. I created this issue to discuss the scheduling of these pods in order to downscale properly.

It is my guess, that when these schedule, they will use the default scheduler. The default scheduler will spread the pods across the available nodes. But, you probably want to pack them tight as that allows nodes to free up and scale down.

So, I'm proposing to consider using the Z2JH helm chart's provided scheduler instead.

Example when packing pods would be important

Say that you have 100 users and they all schedule 10 pods each, and this leads to 20 powerful nodes (dedicated for worker pods) are created for example. These 100 users now all leave at the same time and suddenly there are no worker pods scheduled on these 20 powerful nodes. But then two users drop in, they both schedule 10 worker pods that are scheduled by the default scheduler which spreads them apart on the 20 nodes. Suddenly 20 nodes are utilized and may therefore not scale down...

If the z2jh user pod scheduler would been used instead, it would have packed the pods tight on one node instead, making 19 other nodes scale down keeping one running only for the two users.

Concerns of this idea

It may require the z2jh provided scheduler to schedule pods in different namespaces, is it configured for this? I don't remember. But we could probably make this configurable in the z2jh helm chart if not.

I tried evaluating this right away, but I'm still not confident about the situation. But now I know that the z2jh user pod scheduler has rights to list and schedule pods in all namespaces, but is its configuration making it available to do so? I think so but I'm not sure.

Related to the namespace concern

Anything in the the z2jh user-scheduler folder
The z2jh user pod scheduler's container initialization
The RBAC permissions of the z2jh user pod scheduler

jhamman · 2019-05-09T20:21:44Z

@consideRatio - I think you have the current jupyter/dask scaling behavior mostly right. I think we should look into this further but I'm not clear exactly how we would go about trying this. Do you have some specific ideas on how we could test this? We should feel free to try anything we want on https://hub.pangeo.io (https://github.com/pangeo-data/pangeo-cloud-federation/tree/staging/deployments/dev).

consideRatio · 2019-05-09T23:49:24Z

@jhamman the key thing to change, is to make sure the worker pods get the schedulerName specified to the z2jh provided scheduler named {{ .Release.Name }}-user-scheduler. Then we can see if they schedule as they should. I bet we can manually specify this from dask-kubernetes somehow, but that would be a bit messy and I would want it to be specified by default instead.

We can test this without modification to the helm chart by manually asking dask-kubernetes to adjust the pod spec to use a certain scheduler
Can we influence dask-kubernetes default worker spec to include schedulerName: {{ .Release.Name }}-user-scheduler? Is there a file we can place in a certain directory that isn't the user directory that would influence it?
It is technically possible, but practically not suitable, to add a kubernetes MutableAdmissionController in k8s that will modify all pod specs before letting them be scheduled, then we could look for worker pods and modify them. I really don't recommend this as it may be impossible to accomplish from a Helm chart.

I know too little about dask-kubernetes still... Looking in the examples in the docs I think what we want to do is...

from dask_kubernetes import KubeCluster, make_pod_spec
# the namespace could be dynamically fetched from an ENV var we set
helm_release_name = "jupyterhub"
pod_spec = make_pod_spec(
    extra_pod_config={
        "schedulerName": f"{helm_release_name}-user-scheduler",
    }
)
cluster = KubeCluster(pod_spec)

# trigger a cluster-autoscaler scale up to >=2 worker nodes
cluster.scale_up(10)

# remove all pods to have two empty worker nodes
cluster.scale_up(0)

# add some pods again, and see if they schedule on the same node or they spread out
# these pods should with the configured scheduler:
# --- schedule at all
# --- schedule on the same node
cluster.scale_up(5)

A good verification step along the way is to inspect the pod spec of the worker and verify it has the schedulerName specified correctly to the z2jh provided scheduler.

jhamman · 2019-05-10T00:43:36Z

Thanks @consideRatio - this makes sense. I'll give it a try. I've also just invited you to the pangeo github org so you should have access to hub.pangeo.io where you could try this out yourself if you like. I can also get you setup on our GCP account if if would be useful to you going forward (just send me a dm with your google account id).

jhamman · 2019-05-16T05:18:43Z

I tried this out and then realized we've turned off the user scheduler...so things didn't work:

pangeo-cloud-federation/pangeo-deploy/values.yaml

Lines 39 to 43 in 95acbe0

    
           scheduling: 
        
               userScheduler: 
        
                 enabled: false 
        
               userPlaceholder: 
        
                 enabled: false

I think we need to turn this back on before proceeding. I think now that we have better handling of taints/affinities, we can turn this back on and things should work much better.

scottyhq · 2020-11-04T21:48:19Z

@TomAugspurger - since you were doing some work on dask-gateway worker pod optimizations I wanted to revive this issue and bring it to your attention. @consideRatio 's original example of how this is important is a good one for the jupyterhub case.

Another situation where I've observed this coming up is if a single user runs a really big computation using cluster.adapt() and then trims down the number of worker pods to explore results. You can easily end up with 10 pods spread across 10 nodes instead of 10 pods packed onto one node. I opened an issue about this on dask-kubernetes a while back: dask/dask-kubernetes#233

Not sure how things stand currently with dask-gateway and whether the schedulings settings are consistent across hubs currently.

TomAugspurger · 2020-11-04T21:53:46Z

Thanks, I am indeed looking at this right now :)

I'm hoping #807 is a start, just to ensure that we don't have cases where the requests are just over the allocatable CPU / memory because of kubernetes system pods.

I'm also concerned about the adaptive case you raise. I worry that kubernetes prefers to spread new pods out, thus preventing a node from retiring. I'm seeing if https://kubernetes.io/docs/concepts/scheduling-eviction/resource-bin-packing/ can help.

consideRatio · 2020-11-04T22:25:11Z

I worry that kubernetes prefers to spread new pods out, thus preventing a node from retiring.

The k8s default pod scheduler is doing this across all versions of k8s, while the z2jh helm chart's user-scheduler packs pods on the most busy nodes by pods that has declared they want to be scheduled by it. I'm not sure if it would work to ask to be scheduled by the z2jh scheduler if you are in another namespace.

I'm seeing if https://kubernetes.io/docs/concepts/scheduling-eviction/resource-bin-packing/ can help.

This can only help assuming you deploy your own kube-scheduler binary, because only then can you configure it at the moment. In the future, it may be able to adopt based on just creating configmaps in kube-system or similar, but for now, you must deploy it yourself. This is what z2jh does.

The user-scheduler, which is a kube-scheduler configured for z2jh's need to pack user pods, is configured in two different ways depending on what k8s version z2jh is installed on. We use kube-scheduler from k8s 1.19.2 currently if you are on a k8s cluster 1.16+, and I don't want to think any more about the old configuration option which is deprecated which we used before.

Here are the configuration options we rely on to influence the scoring of nodes where a pod can be scheduled, and the rule called "NodeResourcesMostAllocated" is the one of relevance to help us pack pods.

I would strongly recommend trying to make use of the user-schedulers logic rather than deploy your own, because what you want is the same behavior I think. Perhaps you need to make it able to schedule pods in different namespaces if it can't do that already, I have not tried and isn't sure about it.

To reason if it can schedule pods from other namespaces, I inspect the permissions we grant it here. I conclude that we grant the user-scheduler a ClusterRole only, and not a Role, so I think it can very well end up being able to schedule other pods.

On this line you can see that we let the user-scheduler's kube-scheduler binary be configured with a name, it is this name that you could reference in pod's spec.schedulerName: myz2jhreleasename-user-scheduler.

Does dask-gateway have extraPodSpec? It seems so here. I think what you need to do is to configure your dask-gateway helm chart like this.

Solution suggestion

Configure dask-gateway to let its worker pods schedule with the z2jh user-scheduler.

gateway:
  backend:
    worker:
      extraPodConfig:
        schedulerName: <your-helm-release-name>-user-scheduler

TomAugspurger · 2020-11-05T02:11:20Z

Thanks for the explanation Erik! I will try that out.

I think I will also add that config to the dask-gateway scheduler nodes. This does carry a risk that we oversubscribe the node-pool with the scheduler, since it will be placed on busy machines, but I think this would happen regardless as new worker pods show up anyway. If this is a problem in practice, then we can create a new node pool dedicated to schedulers (on GCP we're using GCE's nodepool auto-provisioner anyway).

Closes pangeo-data#273, xref pangeo-data#273 (comment).

Closes #273, xref #273 (comment).

consideRatio · 2020-11-05T03:23:46Z

@TomAugspurger 🎉 =)

I think I will also add that config to the dask-gateway scheduler nodes.

It sounds good to let the dask-scheduler pod created for each user's dask cluster be either scheduled by the user-scheduler as well, or let them be scheduled on a dedicated node that doesn't risk being shut down if running with cheap premeptible/spot priced nodes.

I'm not sure how much network traffic will flow between pods etc, but I know that bigger nodes typically get better network traffic capacity, so if schedulers are low-cpu high-netowork then they may be limited on a low-cpu node network wise.

scottyhq · 2020-11-05T20:22:37Z

Thanks for the all the information and improvements @TomAugspurger and @consideRatio !

We use kube-scheduler from k8s 1.19.2 currently if you are on a k8s cluster 1.16+,

Just a note that we have been falling behind a bit on AWS with K8s version upgrades. BinderHub is running on 1.15 and jupyterhub on 1.16... 1.18 is now available on EKS. These new versions are released ~every 3 months. As far as I know once 1.19 is released out 1.15 cluster would be automatically upgraded due to supported versions. I'm tempted to leave things and see how that automatic upgrade pans out!

Closes #273, xref #273 (comment).

jhamman mentioned this issue May 16, 2019

turn on userScheduler again #283

Merged

TomAugspurger added a commit to TomAugspurger/pangeo-cloud-federation that referenced this issue Nov 5, 2020

Schedule dask gateway pods with user-scheduler

fac81a0

Closes pangeo-data#273, xref pangeo-data#273 (comment).

TomAugspurger mentioned this issue Nov 5, 2020

Schedule dask gateway pods with user-scheduler #846

Merged

TomAugspurger closed this as completed in #846 Nov 5, 2020

TomAugspurger added a commit that referenced this issue Nov 5, 2020

Schedule dask gateway pods with user-scheduler

26cddbd

Closes #273, xref #273 (comment).

TomAugspurger added a commit that referenced this issue Nov 6, 2020

Schedule dask gateway pods with user-scheduler

05bd95c

Closes #273, xref #273 (comment).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scheduling dask worker pods #273

Scheduling dask worker pods #273

consideRatio commented May 9, 2019 •

edited

Loading

jhamman commented May 9, 2019

consideRatio commented May 9, 2019

jhamman commented May 10, 2019

jhamman commented May 16, 2019 •

edited

Loading

scottyhq commented Nov 4, 2020

TomAugspurger commented Nov 4, 2020

consideRatio commented Nov 4, 2020

TomAugspurger commented Nov 5, 2020

consideRatio commented Nov 5, 2020

scottyhq commented Nov 5, 2020

Scheduling dask worker pods #273

Scheduling dask worker pods #273

Comments

consideRatio commented May 9, 2019 • edited Loading

Example when packing pods would be important

Concerns of this idea

Related to the namespace concern

jhamman commented May 9, 2019

consideRatio commented May 9, 2019

jhamman commented May 10, 2019

jhamman commented May 16, 2019 • edited Loading

scottyhq commented Nov 4, 2020

TomAugspurger commented Nov 4, 2020

consideRatio commented Nov 4, 2020

Solution suggestion

TomAugspurger commented Nov 5, 2020

consideRatio commented Nov 5, 2020

scottyhq commented Nov 5, 2020

consideRatio commented May 9, 2019 •

edited

Loading

jhamman commented May 16, 2019 •

edited

Loading