Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[autoscaler] GCP TPU VM autoscaler #17278

Merged
merged 28 commits into from
Jul 28, 2021
Merged

Conversation

Yard1
Copy link
Member

@Yard1 Yard1 commented Jul 22, 2021

Why are these changes needed?

WIP!

This PR adds support for GCP TPU VMs in autoscaler. This is achieved by refactoring and abstracting the current GCP autoscaler code to allow for different types of resources, with different APIs (Compute and TPU).

This will allow for users to take advantage of GCP TPUs without having to hack support for them themselves, making projects such as https://github.com/kingoflolz/swarm-jax/tree/master/swarm_jax much simpler.

TPU pods are not yet supported (only v2-8 and v3-8 instances are supported).

Requires further testing. Not all functionality may work with TPUs. The existing (compute) functionality should be unaffected, unless by mistake. The provided config is very WIP.

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@Yard1 Yard1 added the do-not-merge Do not merge this PR! label Jul 22, 2021
@Yard1 Yard1 requested review from richardliaw and ijrsvt July 22, 2021 23:38
@richardliaw
Copy link
Contributor

cc @hartikainen ;)

Copy link
Contributor

@ijrsvt ijrsvt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks pretty reasonable!

python/ray/autoscaler/_private/gcp/config.py Outdated Show resolved Hide resolved
python/ray/autoscaler/_private/gcp/config.py Show resolved Hide resolved
python/ray/autoscaler/_private/gcp/node_provider.py Outdated Show resolved Hide resolved
@Yard1 Yard1 requested a review from ijrsvt July 23, 2021 19:18
@hartikainen
Copy link
Contributor

Exciting! Thanks for cc'ing me @richardliaw. I skimmed through the changes and they all looks reasonable to me 🙂

@Yard1 Yard1 marked this pull request as ready for review July 27, 2021 15:49
@Yard1 Yard1 changed the title [WIP][autoscaler] GCP TPU VM autoscaler [autoscaler] GCP TPU VM autoscaler Jul 27, 2021
@Yard1 Yard1 removed the do-not-merge Do not merge this PR! label Jul 27, 2021
@Yard1
Copy link
Member Author

Yard1 commented Jul 27, 2021

TPU board support is feature complete and working. TPU pods are not yet supported (will be in a follow-up PR).

Copy link
Contributor

@richardliaw richardliaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice work! have you tested to make sure all the example gcp scripts still run properly?

@Yard1
Copy link
Member Author

Yard1 commented Jul 27, 2021

@richardliaw Yes, all the previous cluster configs work fine (tested setup, automatic node creation, idle teardown and termination) and all of the existing compute functionality is unaffected.

@richardliaw
Copy link
Contributor

This PR looks OK to me; @ijrsvt can you review/merge?

Comment on lines +264 to +266
# We can't run autoscaling through a serviceAccount on TPUs (atm)
if _is_head_node_a_tpu(config):
raise RuntimeError("TPUs are not supported as head nodes.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this check!

Copy link
Contributor

@ijrsvt ijrsvt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM from my perspective.

@ijrsvt
Copy link
Contributor

ijrsvt commented Jul 28, 2021

Test failures look unrelated (Windows & non-autoscaler), so going ahead with merging!

@ijrsvt ijrsvt merged commit 1f35470 into ray-project:master Jul 28, 2021
stephanie-wang pushed a commit to stephanie-wang/ray that referenced this pull request Jul 31, 2021
DmitriGekhtman pushed a commit that referenced this pull request Aug 3, 2021
* Enforce per-node-type max workers

* type annonation

Co-authored-by: Ameer Haj Ali <[email protected]>

* cleanup. comments. type annotations

* additional type annotation

Co-authored-by: Ameer Haj Ali <[email protected]>

* additional cleanup. comments. type annotations

* _get_nodes_needed_for_request_resources to use FrozenSet

* comments

* whitespace

* [Placement Group] Fix resource index assignment between with bundle index and without bundle index pg (#17318)

* [serve] Add Ray API stability annotations (#17295)

* Support streaming output of runtime env setup to logger/driver (#17306)

* [SGD] v2 prototype: ``WorkerGroup`` implementation (#17330)

* wip

* formatting

* increase timeouts

* address comments

* comments

* fix

* address comments

* Update python/ray/util/sgd/v2/worker_group.py

Co-authored-by: Richard Liaw <[email protected]>

* Update python/ray/util/sgd/v2/worker_group.py

Co-authored-by: Richard Liaw <[email protected]>

* address comments

* formatting

* fix

* avoid race condition

Co-authored-by: Richard Liaw <[email protected]>

* [RLlib] Discussion 3001: Fix comment on internal state shape (must be [B x S=state dim]). (#17341)

* [autoscaler] GCP TPU VM autoscaler (#17278)

* [Rllib] set self._allow_unknown_config (#17335)

Co-authored-by: Sven Mika <[email protected]>

* [RLlib] Discussion 2294: Custom vector env example and fix. (#16083)

* [docs] Link broken in Tune's page (#17394) (#17407)

* [Serve] Fix response_model for class based view routes as well (#17376)

* [serve] Fix single deployment nightly test (#17368)

* [RLlib] SAC tuple observation space fix (#17356)

* Support schema on read for csv/json (#17354)

* [RLlib] New and changed version of parametric actions cartpole example + small suggested update in policy_client.py (#15664)

* [gcs] Fix GCS related issues: ByteSizeLong and redis connection (#17373)

* [runtime_env] Gracefully fail tasks when an environment fails to be set up (#17249)

* [docs] update docs with pip requirements (#17317)

* removed nodes_to_keep. cleanup

* formatting

* +comment

* treat max_workers=0 as 0 workers (as opposed to unlimited)

* fix wrong comment

* warning for inconsistent config

* terminate nodes with no matching node type right away

* quotes

* special handling for head node when enforcing max_workers per type. tests. cleanup

* cleanup comments and prints

* comments

* cleanup. removed special handling of head node.

* adding an eplicit non-None check in schedule_node_termination

* raise the exception

Co-authored-by: Ameer Haj Ali <[email protected]>

Co-authored-by: Ameer Haj Ali <[email protected]>
Co-authored-by: DK.Pino <[email protected]>
Co-authored-by: Edward Oakes <[email protected]>
Co-authored-by: Simon Mo <[email protected]>
Co-authored-by: Amog Kamsetty <[email protected]>
Co-authored-by: Richard Liaw <[email protected]>
Co-authored-by: Sven Mika <[email protected]>
Co-authored-by: Antoni Baum <[email protected]>
Co-authored-by: Rohan138 <[email protected]>
Co-authored-by: amavilla <[email protected]>
Co-authored-by: Jiao <[email protected]>
Co-authored-by: Julius Frost <[email protected]>
Co-authored-by: Eric Liang <[email protected]>
Co-authored-by: kk-55 <[email protected]>
Co-authored-by: Yi Cheng <[email protected]>
Co-authored-by: matthewdeng <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants