Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dask Cluster Lifecycle Manager for Idle clusters #687

Closed
JoeJasinski opened this issue Feb 20, 2023 · 6 comments
Closed

Dask Cluster Lifecycle Manager for Idle clusters #687

JoeJasinski opened this issue Feb 20, 2023 · 6 comments

Comments

@JoeJasinski
Copy link
Contributor

JoeJasinski commented Feb 20, 2023

As an administrator of Dask Gateway on Kubernetes, I would like to have the option to have Dask Gateway automatically delete Dask clusters that have been idle for a configurable amount of time, so that users of the gateway client do not have to explicitly shut down their cluster and risk wasting money on idle resources.

I was wondering if there are any existing features in Dask Gateway that mange the lifecycle of existing clusters. When a user is finished using a Dask cluster, is there any means to clean up the Custom Resource do that clusters do not accumulate? For example, if a user creates a cluster but forgets to delete it after they are done with it, can it automatically be cleaned up after a period of time? I understand that the HPAs associated with each cluster do help with cost savings by scaling down the clusters, but there's still the potential for a lot of scheduler pods to be hanging around, consuming resources until they are manually removed.

I created a similar service for Spark Clusters and had a process that ran and checks the master's /json endpoint for idle applications every few minutes. It looks like Dask has an api on the scheduler that might serve as a place to look for idle apps as well (https://distributed.dask.org/en/stable/http_services.html). I might find that an interesting challenge to create something like that for Dask Gateway, if no such process exists. But I'm pretty new to Dask, so I'm not sure what the best architecture for that would be or if this feature already exists in some form.

@TomAugspurger
Copy link
Member

Does the idle_timeout configuration option do what you want?

@JoeJasinski
Copy link
Contributor Author

Thanks for posting that link. That looks like it might be what I'm looking for! Does it clean up the Custom Resources when it times out? I'll give it as shot. Thanks again

@JoeJasinski
Copy link
Contributor Author

JoeJasinski commented Feb 26, 2023

I had a chance to try out that idle_timeout and it works well. One thing I noticed is that when the idle_timeout expires, the cluster gets deleted, but the "daskcluster" custom resource still exists. I imagine that those could accumulate over time if there were a lot of clusters spinning up and down. Is there a way to easily detect if those aren't running and clean up the Cutom Resources?

Also, I noticed that when the cluster does shut down, if there's still a python session connected, it returns a really confusing error message as below. I was wondering if there was a way to make that fail more gracefully. I didn't realize this was a timeout error until I saw that the cluster had been shut down in the (user inaccessible) logs.

root@dask-client:/src# python
Python 3.9.16 (main, Feb 11 2023, 02:49:26) 
[GCC 10.2.1 20210110] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from dask_gateway import Gateway
>>> 
>>> gateway = Gateway("http://traefik-dask-gateway:80")
>>> print(gateway.list_clusters())
[]
>>> cluster = gateway.new_cluster()
>>> client = cluster.get_client()
/usr/local/lib/python3.9/site-packages/distributed/client.py:1361: VersionMismatchWarning: Mismatched versions found

+-------------+----------------+----------------+---------+
| Package     | Client         | Scheduler      | Workers |
+-------------+----------------+----------------+---------+
| dask        | 2023.2.1       | 2022.12.1      | None    |
| distributed | 2023.2.1       | 2022.12.1      | None    |
| python      | 3.9.16.final.0 | 3.11.1.final.0 | None    |
+-------------+----------------+----------------+---------+
  warnings.warn(version_module.VersionMismatchWarning(msg[0]["warning"]))
>>> 2023-02-26 05:45:35,766 - tornado.application - ERROR - Exception in callback <bound method Client._heartbeat of <Client: 'tls://10.244.0.12:8786' processes=0 threads=0, memory=0 B>>
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/tornado/ioloop.py", line 921, in _run
    val = self.callback()
  File "/usr/local/lib/python3.9/site-packages/distributed/client.py", line 1445, in _heartbeat
    self.scheduler_comm.send({"op": "heartbeat-client"})
AttributeError: 'NoneType' object has no attribute 'send'
2023-02-26 05:45:40,766 - tornado.application - ERROR - Exception in callback <bound method Client._heartbeat of <Client: 'tls://10.244.0.12:8786' processes=0 threads=0, memory=0 B>>
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/tornado/ioloop.py", line 921, in _run
    val = self.callback()
  File "/usr/local/lib/python3.9/site-packages/distributed/client.py", line 1445, in _heartbeat
    self.scheduler_comm.send({"op": "heartbeat-client"})
AttributeError: 'NoneType' object has no attribute 'send'
2023-02-26 05:45:45,767 - tornado.application - ERROR - Exception in callback <bound method Client._heartbeat of <Client: 'tls://10.244.0.12:8786' processes=0 threads=0, memory=0 B>>
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/tornado/ioloop.py", line 921, in _run
    val = self.callback()
  File "/usr/local/lib/python3.9/site-packages/distributed/client.py", line 1445, in _heartbeat
    self.scheduler_comm.send({"op": "heartbeat-client"})

I created a dummy project to help me test it here. This is what I used in this example.
https://github.com/JoeJasinski/dask-gateway-testing

@jacobtomlinson
Copy link
Member

This may be a duplicate of #255

@consideRatio
Copy link
Collaborator

One thing I noticed is that when the idle_timeout expires, the cluster gets deleted, but the "daskcluster" custom resource still exists.

The k8s DaskCluster resource enters a "Stopped" state.

apiVersion: gateway.dask.org/v1alpha1
kind: DaskCluster
# ...
status:
  completionTime: "2023-10-25T11:43:39Z"
  credentials: dask-credentials-b3a990d302d84720aae27404f6153ade
  ingressroute: dask-b3a990d302d84720aae27404f6153ade
  ingressroutetcp: dask-b3a990d302d84720aae27404f6153ade
  phase: Stopped
  schedulerPod: dask-scheduler-b3a990d302d84720aae27404f6153ade
  service: dask-b3a990d302d84720aae27404f6153ade

The question about this could pivot to "should a stopped DaskCluster resources get cleaned up directly, or after some time?".

This is similar to having k8s Job resource creating a Pod to do some work. Then the Pod and Job is left in a "Completed" state a while. There is a topic about that.

When a Job completes, no more Pods are created, but the Pods are usually not deleted either. Keeping them around allows you to still view the logs of completed pods to check for errors, warnings, or other diagnostic output. The job object also remains after it is completed so that you can view its status. It is up to the user to delete old jobs after noting their status.

CronJob, that is a k8s resource to create Job resources, can cleanup the Job resources and it creates.

Finished Jobs are usually no longer needed in the system. Keeping them around in the system will put pressure on the API server. If the Jobs are managed directly by a higher level controller, such as CronJobs, the Jobs can be cleaned up by CronJobs based on the specified capacity-based cleanup policy.

It appears that in k8s 1.23+ (now probably used by most k8s clusters), there is a controller reading the k8s Job resource's ttlSecondsAfterFinished. I think it can make sense for the k8s dask-gateway resource controller to respect such configuration as well.

@consideRatio
Copy link
Collaborator

I opened #760 about the cleanup part, closing this issue as resolved by the idle_timeout configuration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants