-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scheduling dask worker pods #273
Comments
@consideRatio - I think you have the current jupyter/dask scaling behavior mostly right. I think we should look into this further but I'm not clear exactly how we would go about trying this. Do you have some specific ideas on how we could test this? We should feel free to try anything we want on https://hub.pangeo.io (https://github.com/pangeo-data/pangeo-cloud-federation/tree/staging/deployments/dev). |
@jhamman the key thing to change, is to make sure the worker pods get the
I know too little about from dask_kubernetes import KubeCluster, make_pod_spec
# the namespace could be dynamically fetched from an ENV var we set
helm_release_name = "jupyterhub"
pod_spec = make_pod_spec(
extra_pod_config={
"schedulerName": f"{helm_release_name}-user-scheduler",
}
)
cluster = KubeCluster(pod_spec) # trigger a cluster-autoscaler scale up to >=2 worker nodes
cluster.scale_up(10)
# remove all pods to have two empty worker nodes
cluster.scale_up(0)
# add some pods again, and see if they schedule on the same node or they spread out
# these pods should with the configured scheduler:
# --- schedule at all
# --- schedule on the same node
cluster.scale_up(5) A good verification step along the way is to inspect the pod spec of the worker and verify it has the schedulerName specified correctly to the z2jh provided scheduler. |
Thanks @consideRatio - this makes sense. I'll give it a try. I've also just invited you to the pangeo github org so you should have access to hub.pangeo.io where you could try this out yourself if you like. I can also get you setup on our GCP account if if would be useful to you going forward (just send me a dm with your google account id). |
I tried this out and then realized we've turned off the user scheduler...so things didn't work: pangeo-cloud-federation/pangeo-deploy/values.yaml Lines 39 to 43 in 95acbe0
I think we need to turn this back on before proceeding. I think now that we have better handling of taints/affinities, we can turn this back on and things should work much better. |
@TomAugspurger - since you were doing some work on dask-gateway worker pod optimizations I wanted to revive this issue and bring it to your attention. @consideRatio 's original example of how this is important is a good one for the jupyterhub case. Another situation where I've observed this coming up is if a single user runs a really big computation using Not sure how things stand currently with dask-gateway and whether the schedulings settings are consistent across hubs currently. |
Thanks, I am indeed looking at this right now :) I'm hoping #807 is a start, just to ensure that we don't have cases where the requests are just over the allocatable CPU / memory because of kubernetes system pods. I'm also concerned about the adaptive case you raise. I worry that kubernetes prefers to spread new pods out, thus preventing a node from retiring. I'm seeing if https://kubernetes.io/docs/concepts/scheduling-eviction/resource-bin-packing/ can help. |
The k8s default pod scheduler is doing this across all versions of k8s, while the z2jh helm chart's user-scheduler packs pods on the most busy nodes by pods that has declared they want to be scheduled by it. I'm not sure if it would work to ask to be scheduled by the z2jh scheduler if you are in another namespace.
This can only help assuming you deploy your own kube-scheduler binary, because only then can you configure it at the moment. In the future, it may be able to adopt based on just creating configmaps in kube-system or similar, but for now, you must deploy it yourself. This is what z2jh does. The user-scheduler, which is a kube-scheduler configured for z2jh's need to pack user pods, is configured in two different ways depending on what k8s version z2jh is installed on. We use kube-scheduler from k8s 1.19.2 currently if you are on a k8s cluster 1.16+, and I don't want to think any more about the old configuration option which is deprecated which we used before. Here are the configuration options we rely on to influence the scoring of nodes where a pod can be scheduled, and the rule called "NodeResourcesMostAllocated" is the one of relevance to help us pack pods. I would strongly recommend trying to make use of the user-schedulers logic rather than deploy your own, because what you want is the same behavior I think. Perhaps you need to make it able to schedule pods in different namespaces if it can't do that already, I have not tried and isn't sure about it. To reason if it can schedule pods from other namespaces, I inspect the permissions we grant it here. I conclude that we grant the user-scheduler a ClusterRole only, and not a Role, so I think it can very well end up being able to schedule other pods. On this line you can see that we let the user-scheduler's kube-scheduler binary be configured with a name, it is this name that you could reference in pod's Does dask-gateway have extraPodSpec? It seems so here. I think what you need to do is to configure your dask-gateway helm chart like this. Solution suggestionConfigure dask-gateway to let its worker pods schedule with the z2jh user-scheduler. gateway:
backend:
worker:
extraPodConfig:
schedulerName: <your-helm-release-name>-user-scheduler |
Thanks for the explanation Erik! I will try that out. I think I will also add that config to the dask-gateway scheduler nodes. This does carry a risk that we oversubscribe the node-pool with the scheduler, since it will be placed on busy machines, but I think this would happen regardless as new worker pods show up anyway. If this is a problem in practice, then we can create a new node pool dedicated to schedulers (on GCP we're using GCE's nodepool auto-provisioner anyway). |
@TomAugspurger 🎉 =)
It sounds good to let the dask-scheduler pod created for each user's dask cluster be either scheduled by the user-scheduler as well, or let them be scheduled on a dedicated node that doesn't risk being shut down if running with cheap premeptible/spot priced nodes. I'm not sure how much network traffic will flow between pods etc, but I know that bigger nodes typically get better network traffic capacity, so if schedulers are low-cpu high-netowork then they may be limited on a low-cpu node network wise. |
Thanks for the all the information and improvements @TomAugspurger and @consideRatio !
Just a note that we have been falling behind a bit on AWS with K8s version upgrades. BinderHub is running on 1.15 and jupyterhub on 1.16... 1.18 is now available on EKS. These new versions are released ~every 3 months. As far as I know once 1.19 is released out 1.15 cluster would be automatically upgraded due to supported versions. I'm tempted to leave things and see how that automatic upgrade pans out! |
I'm quite new in the domain of pangeo and Dask, but my understanding is that the users of a pangeo jupyter pod has the rights to use dask-kubernetes to spawn dask-worker pods. I created this issue to discuss the scheduling of these pods in order to downscale properly.
It is my guess, that when these schedule, they will use the default scheduler. The default scheduler will spread the pods across the available nodes. But, you probably want to pack them tight as that allows nodes to free up and scale down.
So, I'm proposing to consider using the Z2JH helm chart's provided scheduler instead.
Example when packing pods would be important
Say that you have 100 users and they all schedule 10 pods each, and this leads to 20 powerful nodes (dedicated for worker pods) are created for example. These 100 users now all leave at the same time and suddenly there are no worker pods scheduled on these 20 powerful nodes. But then two users drop in, they both schedule 10 worker pods that are scheduled by the default scheduler which spreads them apart on the 20 nodes. Suddenly 20 nodes are utilized and may therefore not scale down...
If the z2jh user pod scheduler would been used instead, it would have packed the pods tight on one node instead, making 19 other nodes scale down keeping one running only for the two users.
Concerns of this idea
It may require the z2jh provided scheduler to schedule pods in different namespaces, is it configured for this? I don't remember. But we could probably make this configurable in the z2jh helm chart if not.
I tried evaluating this right away, but I'm still not confident about the situation. But now I know that the z2jh user pod scheduler has rights to list and schedule pods in all namespaces, but is its configuration making it available to do so? I think so but I'm not sure.
Related to the namespace concern
The text was updated successfully, but these errors were encountered: