-
Notifications
You must be signed in to change notification settings - Fork 504
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GCP] Adopt new provisioner to stop/down clusters #2199
Conversation
@@ -71,7 +71,7 @@ | |||
_NODES_LAUNCHING_PROGRESS_TIMEOUT = { | |||
clouds.AWS: 90, | |||
clouds.Azure: 90, | |||
clouds.GCP: 120, | |||
clouds.GCP: 240, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why need to change this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a known issue that the 120
seconds is not enough for launching multiple nodes for GCP. We need to increase this to make sure sky launch --cloud gcp --num-nodes 8
to work.
…t into gcp-new-termination
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks for reviewing this @suquark! I am now testing it with:
|
This PR adopts the new provisioner refactoring from #2121.
Fixes #2089, to avoid the nodes to be leaked when the cluster is partially down.
Also Fixes #1983, as we will now send the requests first and wait the completion of together (using multiple threads in ray-project/ray#34455 is buggy, refer to #2160 (comment))
Previously:
Stopping 8 nodes takes: 168s
Now:
Stopping 8 nodes takes: 40s
Tested (run the relevant ones):
bash format.sh
sky launch -c test-gcp-mn --num-nodes 8 --cpus 2+ --cloud gcp; sky stop -y test-gcp-mn; sky down -y test-gcp-mn
sky launch -c test-gcp-mn --num-nodes 8 --cpus 2+ --cloud gcp
; manually stop some of the nodes;sky down -y test-gcp-mn
sky launch -c test-gcp-tpu --cpus 2+ --gpus tpu-v2-8; sky stop test-gcp-tpu-mn; sky down -y test-gcp-tpu-mn
sky launch -c test-tpu-vm test.yaml; sky stop test-tpu-vm; sky start test-tpu-vm; sky down test-tpu-vm
sky launch -c test-tpu-vm test.yaml
; wait for the tpu vm to be preempted;sky status -r test-tpu-vm
correctly terminate the tpu vm on the cloud.pytest tests/test_smoke.py
(except fortest_cancel_gcp
,test_distributed_tf
due to the issue of the tensorflow code for resnet training [Examples] TF training code for ResNet no longer working #2207)pytest tests/test_smoke.py::test_fill_in_the_name
bash tests/backward_comaptibility_tests.sh