Dask Cluster Lifecycle Manager for Idle clusters #687

JoeJasinski · 2023-02-20T19:51:35Z

As an administrator of Dask Gateway on Kubernetes, I would like to have the option to have Dask Gateway automatically delete Dask clusters that have been idle for a configurable amount of time, so that users of the gateway client do not have to explicitly shut down their cluster and risk wasting money on idle resources.

I was wondering if there are any existing features in Dask Gateway that mange the lifecycle of existing clusters. When a user is finished using a Dask cluster, is there any means to clean up the Custom Resource do that clusters do not accumulate? For example, if a user creates a cluster but forgets to delete it after they are done with it, can it automatically be cleaned up after a period of time? I understand that the HPAs associated with each cluster do help with cost savings by scaling down the clusters, but there's still the potential for a lot of scheduler pods to be hanging around, consuming resources until they are manually removed.

I created a similar service for Spark Clusters and had a process that ran and checks the master's /json endpoint for idle applications every few minutes. It looks like Dask has an api on the scheduler that might serve as a place to look for idle apps as well (https://distributed.dask.org/en/stable/http_services.html). I might find that an interesting challenge to create something like that for Dask Gateway, if no such process exists. But I'm pretty new to Dask, so I'm not sure what the best architecture for that would be or if this feature already exists in some form.

TomAugspurger · 2023-02-25T14:32:20Z

Does the idle_timeout configuration option do what you want?

JoeJasinski · 2023-02-25T19:13:57Z

Thanks for posting that link. That looks like it might be what I'm looking for! Does it clean up the Custom Resources when it times out? I'll give it as shot. Thanks again

JoeJasinski · 2023-02-26T05:43:24Z

I had a chance to try out that idle_timeout and it works well. One thing I noticed is that when the idle_timeout expires, the cluster gets deleted, but the "daskcluster" custom resource still exists. I imagine that those could accumulate over time if there were a lot of clusters spinning up and down. Is there a way to easily detect if those aren't running and clean up the Cutom Resources?

Also, I noticed that when the cluster does shut down, if there's still a python session connected, it returns a really confusing error message as below. I was wondering if there was a way to make that fail more gracefully. I didn't realize this was a timeout error until I saw that the cluster had been shut down in the (user inaccessible) logs.

root@dask-client:/src# python
Python 3.9.16 (main, Feb 11 2023, 02:49:26) 
[GCC 10.2.1 20210110] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from dask_gateway import Gateway
>>> 
>>> gateway = Gateway("http://traefik-dask-gateway:80")
>>> print(gateway.list_clusters())
[]
>>> cluster = gateway.new_cluster()
>>> client = cluster.get_client()
/usr/local/lib/python3.9/site-packages/distributed/client.py:1361: VersionMismatchWarning: Mismatched versions found

+-------------+----------------+----------------+---------+
| Package     | Client         | Scheduler      | Workers |
+-------------+----------------+----------------+---------+
| dask        | 2023.2.1       | 2022.12.1      | None    |
| distributed | 2023.2.1       | 2022.12.1      | None    |
| python      | 3.9.16.final.0 | 3.11.1.final.0 | None    |
+-------------+----------------+----------------+---------+
  warnings.warn(version_module.VersionMismatchWarning(msg[0]["warning"]))
>>> 2023-02-26 05:45:35,766 - tornado.application - ERROR - Exception in callback <bound method Client._heartbeat of <Client: 'tls://10.244.0.12:8786' processes=0 threads=0, memory=0 B>>
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/tornado/ioloop.py", line 921, in _run
    val = self.callback()
  File "/usr/local/lib/python3.9/site-packages/distributed/client.py", line 1445, in _heartbeat
    self.scheduler_comm.send({"op": "heartbeat-client"})
AttributeError: 'NoneType' object has no attribute 'send'
2023-02-26 05:45:40,766 - tornado.application - ERROR - Exception in callback <bound method Client._heartbeat of <Client: 'tls://10.244.0.12:8786' processes=0 threads=0, memory=0 B>>
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/tornado/ioloop.py", line 921, in _run
    val = self.callback()
  File "/usr/local/lib/python3.9/site-packages/distributed/client.py", line 1445, in _heartbeat
    self.scheduler_comm.send({"op": "heartbeat-client"})
AttributeError: 'NoneType' object has no attribute 'send'
2023-02-26 05:45:45,767 - tornado.application - ERROR - Exception in callback <bound method Client._heartbeat of <Client: 'tls://10.244.0.12:8786' processes=0 threads=0, memory=0 B>>
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/tornado/ioloop.py", line 921, in _run
    val = self.callback()
  File "/usr/local/lib/python3.9/site-packages/distributed/client.py", line 1445, in _heartbeat
    self.scheduler_comm.send({"op": "heartbeat-client"})

I created a dummy project to help me test it here. This is what I used in this example.
https://github.com/JoeJasinski/dask-gateway-testing

jacobtomlinson · 2023-03-09T13:59:14Z

This may be a duplicate of #255

consideRatio · 2023-10-25T12:24:50Z

One thing I noticed is that when the idle_timeout expires, the cluster gets deleted, but the "daskcluster" custom resource still exists.

The k8s DaskCluster resource enters a "Stopped" state.

apiVersion: gateway.dask.org/v1alpha1
kind: DaskCluster
# ...
status:
  completionTime: "2023-10-25T11:43:39Z"
  credentials: dask-credentials-b3a990d302d84720aae27404f6153ade
  ingressroute: dask-b3a990d302d84720aae27404f6153ade
  ingressroutetcp: dask-b3a990d302d84720aae27404f6153ade
  phase: Stopped
  schedulerPod: dask-scheduler-b3a990d302d84720aae27404f6153ade
  service: dask-b3a990d302d84720aae27404f6153ade

The question about this could pivot to "should a stopped DaskCluster resources get cleaned up directly, or after some time?".

This is similar to having k8s Job resource creating a Pod to do some work. Then the Pod and Job is left in a "Completed" state a while. There is a topic about that.

When a Job completes, no more Pods are created, but the Pods are usually not deleted either. Keeping them around allows you to still view the logs of completed pods to check for errors, warnings, or other diagnostic output. The job object also remains after it is completed so that you can view its status. It is up to the user to delete old jobs after noting their status.

CronJob, that is a k8s resource to create Job resources, can cleanup the Job resources and it creates.

Finished Jobs are usually no longer needed in the system. Keeping them around in the system will put pressure on the API server. If the Jobs are managed directly by a higher level controller, such as CronJobs, the Jobs can be cleaned up by CronJobs based on the specified capacity-based cleanup policy.

It appears that in k8s 1.23+ (now probably used by most k8s clusters), there is a controller reading the k8s Job resource's ttlSecondsAfterFinished. I think it can make sense for the k8s dask-gateway resource controller to respect such configuration as well.

consideRatio · 2023-10-25T12:42:53Z

I opened #760 about the cleanup part, closing this issue as resolved by the idle_timeout configuration.

jacobtomlinson mentioned this issue Mar 8, 2023

Add idle cluster cleanup dask/dask-kubernetes#667

Closed

consideRatio closed this as completed Oct 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dask Cluster Lifecycle Manager for Idle clusters #687

Dask Cluster Lifecycle Manager for Idle clusters #687

JoeJasinski commented Feb 20, 2023 •

edited

Loading

TomAugspurger commented Feb 25, 2023

JoeJasinski commented Feb 25, 2023

JoeJasinski commented Feb 26, 2023 •

edited

Loading

jacobtomlinson commented Mar 9, 2023

consideRatio commented Oct 25, 2023

consideRatio commented Oct 25, 2023

Dask Cluster Lifecycle Manager for Idle clusters #687

Dask Cluster Lifecycle Manager for Idle clusters #687

Comments

JoeJasinski commented Feb 20, 2023 • edited Loading

TomAugspurger commented Feb 25, 2023

JoeJasinski commented Feb 25, 2023

JoeJasinski commented Feb 26, 2023 • edited Loading

jacobtomlinson commented Mar 9, 2023

consideRatio commented Oct 25, 2023

consideRatio commented Oct 25, 2023

JoeJasinski commented Feb 20, 2023 •

edited

Loading

JoeJasinski commented Feb 26, 2023 •

edited

Loading