[Core] Resource leakage for sky down if a multi-node cluster is partially stopped #2089

Michaelvll · 2023-06-16T02:43:49Z

If a multi-node cluster is partially stopped (during autostop or manually stop the worker node), i.e. the cluster is in INIT state, our backend.teardown function will fail to terminate the stopped nodes, as we are using ray down to terminate the cluster, and ray ignores the stopped nodes.
Related code path:

skypilot/sky/backends/cloud_vm_ray_backend.py

Lines 3367 to 3369 in b962601

    
           elif (terminate and 
        
                 (prev_cluster_status == global_user_state.ClusterStatus.STOPPED or 
        
                  use_tpu_vm)):

To reproduce:

sky launch -c min --cloud gcp --num-nodes 2 --cpus 2
Manually stop the worker node.
sky down min

After that, the head node will be terminated, but the worker node is still in STOPPED state.

One possible solution is to only use our own cloud cli-based termination (should be careful: may need to kill the head node first to avoid ray autoscaler restarting some of the worker nodes, causing leakage).

The text was updated successfully, but these errors were encountered:

Michaelvll · 2023-06-16T02:50:46Z

Related to this, after #2087 is merged, we should be able to move the terminate commands to the cloud class as well, making the backend.teardown function cleaner.

HysunHe · 2023-06-16T06:39:18Z

@Michaelvll Do we need to include the INIT status here?:

elif (terminate and
(prev_cluster_status IN ( global_user_state.ClusterStatus.STOPPED, global_user_state.ClusterStatus.INIT )

Michaelvll added bug Something isn't working P0 labels Jun 16, 2023

Michaelvll mentioned this issue Jun 23, 2023

GCP provisioning: handle 'wait_ready timeout exceeded'. #2124

Merged

5 tasks

Michaelvll mentioned this issue Jul 9, 2023

[GCP] Adopt new provisioner to stop/down clusters #2199

Merged

10 tasks

Michaelvll closed this as completed in #2199 Jul 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] Resource leakage for sky down if a multi-node cluster is partially stopped #2089

[Core] Resource leakage for sky down if a multi-node cluster is partially stopped #2089

Michaelvll commented Jun 16, 2023 •

edited

Loading

Michaelvll commented Jun 16, 2023

HysunHe commented Jun 16, 2023

[Core] Resource leakage for sky down if a multi-node cluster is partially stopped #2089

[Core] Resource leakage for sky down if a multi-node cluster is partially stopped #2089

Comments

Michaelvll commented Jun 16, 2023 • edited Loading

Michaelvll commented Jun 16, 2023

HysunHe commented Jun 16, 2023

Michaelvll commented Jun 16, 2023 •

edited

Loading