You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If a multi-node cluster is partially stopped (during autostop or manually stop the worker node), i.e. the cluster is in INIT state, our backend.teardown function will fail to terminate the stopped nodes, as we are using ray down to terminate the cluster, and ray ignores the stopped nodes.
Related code path:
sky launch -c min --cloud gcp --num-nodes 2 --cpus 2
Manually stop the worker node.
sky down min
After that, the head node will be terminated, but the worker node is still in STOPPED state.
One possible solution is to only use our own cloud cli-based termination (should be careful: may need to kill the head node first to avoid ray autoscaler restarting some of the worker nodes, causing leakage).
The text was updated successfully, but these errors were encountered:
Related to this, after #2087 is merged, we should be able to move the terminate commands to the cloud class as well, making the backend.teardown function cleaner.
If a multi-node cluster is partially stopped (during autostop or manually stop the worker node), i.e. the cluster is in INIT state, our
backend.teardown
function will fail to terminate the stopped nodes, as we are usingray down
to terminate the cluster, and ray ignores the stopped nodes.Related code path:
skypilot/sky/backends/cloud_vm_ray_backend.py
Lines 3367 to 3369 in b962601
To reproduce:
sky launch -c min --cloud gcp --num-nodes 2 --cpus 2
sky down min
After that, the head node will be terminated, but the worker node is still in STOPPED state.
One possible solution is to only use our own cloud cli-based termination (should be careful: may need to kill the head node first to avoid ray autoscaler restarting some of the worker nodes, causing leakage).
The text was updated successfully, but these errors were encountered: