-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dask Cluster Lifecycle Manager for Idle clusters #687
Comments
Does the |
Thanks for posting that link. That looks like it might be what I'm looking for! Does it clean up the Custom Resources when it times out? I'll give it as shot. Thanks again |
I had a chance to try out that Also, I noticed that when the cluster does shut down, if there's still a python session connected, it returns a really confusing error message as below. I was wondering if there was a way to make that fail more gracefully. I didn't realize this was a timeout error until I saw that the cluster had been shut down in the (user inaccessible) logs.
I created a dummy project to help me test it here. This is what I used in this example. |
This may be a duplicate of #255 |
The k8s apiVersion: gateway.dask.org/v1alpha1
kind: DaskCluster
# ...
status:
completionTime: "2023-10-25T11:43:39Z"
credentials: dask-credentials-b3a990d302d84720aae27404f6153ade
ingressroute: dask-b3a990d302d84720aae27404f6153ade
ingressroutetcp: dask-b3a990d302d84720aae27404f6153ade
phase: Stopped
schedulerPod: dask-scheduler-b3a990d302d84720aae27404f6153ade
service: dask-b3a990d302d84720aae27404f6153ade The question about this could pivot to "should a stopped DaskCluster resources get cleaned up directly, or after some time?". This is similar to having k8s Job resource creating a Pod to do some work. Then the Pod and Job is left in a "Completed" state a while. There is a topic about that.
CronJob, that is a k8s resource to create Job resources, can cleanup the Job resources and it creates.
It appears that in k8s 1.23+ (now probably used by most k8s clusters), there is a controller reading the k8s Job resource's |
I opened #760 about the cleanup part, closing this issue as resolved by the |
As an administrator of Dask Gateway on Kubernetes, I would like to have the option to have Dask Gateway automatically delete Dask clusters that have been idle for a configurable amount of time, so that users of the gateway client do not have to explicitly shut down their cluster and risk wasting money on idle resources.
I was wondering if there are any existing features in Dask Gateway that mange the lifecycle of existing clusters. When a user is finished using a Dask cluster, is there any means to clean up the Custom Resource do that clusters do not accumulate? For example, if a user creates a cluster but forgets to delete it after they are done with it, can it automatically be cleaned up after a period of time? I understand that the HPAs associated with each cluster do help with cost savings by scaling down the clusters, but there's still the potential for a lot of scheduler pods to be hanging around, consuming resources until they are manually removed.
I created a similar service for Spark Clusters and had a process that ran and checks the master's /json endpoint for idle applications every few minutes. It looks like Dask has an api on the scheduler that might serve as a place to look for idle apps as well (https://distributed.dask.org/en/stable/http_services.html). I might find that an interesting challenge to create something like that for Dask Gateway, if no such process exists. But I'm pretty new to Dask, so I'm not sure what the best architecture for that would be or if this feature already exists in some form.
The text was updated successfully, but these errors were encountered: