Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GCP] Adopt new provisioner to stop/down clusters #2199

Merged
merged 18 commits into from
Jul 15, 2023

Conversation

Michaelvll
Copy link
Collaborator

@Michaelvll Michaelvll commented Jul 9, 2023

This PR adopts the new provisioner refactoring from #2121.

Fixes #2089, to avoid the nodes to be leaked when the cluster is partially down.
Also Fixes #1983, as we will now send the requests first and wait the completion of together (using multiple threads in ray-project/ray#34455 is buggy, refer to #2160 (comment))

Previously:
Stopping 8 nodes takes: 168s

Now:
Stopping 8 nodes takes: 40s

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
    • sky launch -c test-gcp-mn --num-nodes 8 --cpus 2+ --cloud gcp; sky stop -y test-gcp-mn; sky down -y test-gcp-mn
    • sky launch -c test-gcp-mn --num-nodes 8 --cpus 2+ --cloud gcp; manually stop some of the nodes; sky down -y test-gcp-mn
    • sky launch -c test-gcp-tpu --cpus 2+ --gpus tpu-v2-8; sky stop test-gcp-tpu-mn; sky down -y test-gcp-tpu-mn
    • sky launch -c test-tpu-vm test.yaml; sky stop test-tpu-vm; sky start test-tpu-vm; sky down test-tpu-vm
      resources:
          accelerators: tpu-v2-8
          accelerator_args:
              tpu_vm: true
      
    • sky launch -c test-tpu-vm test.yaml; wait for the tpu vm to be preempted; sky status -r test-tpu-vm correctly terminate the tpu vm on the cloud.
  • All smoke tests: pytest tests/test_smoke.py (except for test_cancel_gcp, test_distributed_tf due to the issue of the tensorflow code for resnet training [Examples] TF training code for ResNet no longer working #2207)
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: bash tests/backward_comaptibility_tests.sh

@Michaelvll Michaelvll requested a review from suquark July 9, 2023 23:02
@Michaelvll Michaelvll marked this pull request as ready for review July 9, 2023 23:07
@@ -71,7 +71,7 @@
_NODES_LAUNCHING_PROGRESS_TIMEOUT = {
clouds.AWS: 90,
clouds.Azure: 90,
clouds.GCP: 120,
clouds.GCP: 240,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why need to change this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a known issue that the 120 seconds is not enough for launching multiple nodes for GCP. We need to increase this to make sure sky launch --cloud gcp --num-nodes 8 to work.

sky/utils/tpu_utils.py Outdated Show resolved Hide resolved
sky/provision/gcp/instance_utils.py Outdated Show resolved Hide resolved
sky/provision/gcp/instance.py Outdated Show resolved Hide resolved
sky/provision/gcp/instance.py Outdated Show resolved Hide resolved
@Michaelvll Michaelvll requested a review from suquark July 11, 2023 22:44
Copy link
Collaborator

@suquark suquark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Michaelvll
Copy link
Collaborator Author

Michaelvll commented Jul 15, 2023

Thanks for reviewing this @suquark! I am now testing it with:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants