[autoscaler] GCP TPU VM autoscaler #17278

Yard1 · 2021-07-22T23:38:34Z

Why are these changes needed?

WIP!

This PR adds support for GCP TPU VMs in autoscaler. This is achieved by refactoring and abstracting the current GCP autoscaler code to allow for different types of resources, with different APIs (Compute and TPU).

This will allow for users to take advantage of GCP TPUs without having to hack support for them themselves, making projects such as https://github.com/kingoflolz/swarm-jax/tree/master/swarm_jax much simpler.

TPU pods are not yet supported (only v2-8 and v3-8 instances are supported).

Requires further testing. Not all functionality may work with TPUs. The existing (compute) functionality should be unaffected, unless by mistake. The provided config is very WIP.

Related issue number

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

richardliaw · 2021-07-22T23:42:46Z

cc @hartikainen ;)

ijrsvt

Overall looks pretty reasonable!

python/ray/autoscaler/_private/gcp/config.py

python/ray/autoscaler/_private/gcp/node_provider.py

hartikainen · 2021-07-24T10:44:15Z

Exciting! Thanks for cc'ing me @richardliaw. I skimmed through the changes and they all looks reasonable to me 🙂

python/ray/autoscaler/_private/gcp/node.py

Yard1 · 2021-07-27T15:56:03Z

TPU board support is feature complete and working. TPU pods are not yet supported (will be in a follow-up PR).

python/ray/autoscaler/_private/gcp/config.py

python/ray/autoscaler/_private/gcp/node.py

richardliaw

nice work! have you tested to make sure all the example gcp scripts still run properly?

Yard1 · 2021-07-27T23:53:58Z

@richardliaw Yes, all the previous cluster configs work fine (tested setup, automatic node creation, idle teardown and termination) and all of the existing compute functionality is unaffected.

richardliaw · 2021-07-28T00:07:34Z

This PR looks OK to me; @ijrsvt can you review/merge?

ijrsvt · 2021-07-28T03:07:08Z

python/ray/autoscaler/_private/gcp/config.py

+        # We can't run autoscaling through a serviceAccount on TPUs (atm)
+        if _is_head_node_a_tpu(config):
+            raise RuntimeError("TPUs are not supported as head nodes.")


Thanks for this check!

ijrsvt

This LGTM from my perspective.

ijrsvt · 2021-07-28T04:24:22Z

Test failures look unrelated (Windows & non-autoscaler), so going ahead with merging!

* Enforce per-node-type max workers * type annonation Co-authored-by: Ameer Haj Ali <[email protected]> * cleanup. comments. type annotations * additional type annotation Co-authored-by: Ameer Haj Ali <[email protected]> * additional cleanup. comments. type annotations * _get_nodes_needed_for_request_resources to use FrozenSet * comments * whitespace * [Placement Group] Fix resource index assignment between with bundle index and without bundle index pg (#17318) * [serve] Add Ray API stability annotations (#17295) * Support streaming output of runtime env setup to logger/driver (#17306) * [SGD] v2 prototype: ``WorkerGroup`` implementation (#17330) * wip * formatting * increase timeouts * address comments * comments * fix * address comments * Update python/ray/util/sgd/v2/worker_group.py Co-authored-by: Richard Liaw <[email protected]> * Update python/ray/util/sgd/v2/worker_group.py Co-authored-by: Richard Liaw <[email protected]> * address comments * formatting * fix * avoid race condition Co-authored-by: Richard Liaw <[email protected]> * [RLlib] Discussion 3001: Fix comment on internal state shape (must be [B x S=state dim]). (#17341) * [autoscaler] GCP TPU VM autoscaler (#17278) * [Rllib] set self._allow_unknown_config (#17335) Co-authored-by: Sven Mika <[email protected]> * [RLlib] Discussion 2294: Custom vector env example and fix. (#16083) * [docs] Link broken in Tune's page (#17394) (#17407) * [Serve] Fix response_model for class based view routes as well (#17376) * [serve] Fix single deployment nightly test (#17368) * [RLlib] SAC tuple observation space fix (#17356) * Support schema on read for csv/json (#17354) * [RLlib] New and changed version of parametric actions cartpole example + small suggested update in policy_client.py (#15664) * [gcs] Fix GCS related issues: ByteSizeLong and redis connection (#17373) * [runtime_env] Gracefully fail tasks when an environment fails to be set up (#17249) * [docs] update docs with pip requirements (#17317) * removed nodes_to_keep. cleanup * formatting * +comment * treat max_workers=0 as 0 workers (as opposed to unlimited) * fix wrong comment * warning for inconsistent config * terminate nodes with no matching node type right away * quotes * special handling for head node when enforcing max_workers per type. tests. cleanup * cleanup comments and prints * comments * cleanup. removed special handling of head node. * adding an eplicit non-None check in schedule_node_termination * raise the exception Co-authored-by: Ameer Haj Ali <[email protected]> Co-authored-by: Ameer Haj Ali <[email protected]> Co-authored-by: DK.Pino <[email protected]> Co-authored-by: Edward Oakes <[email protected]> Co-authored-by: Simon Mo <[email protected]> Co-authored-by: Amog Kamsetty <[email protected]> Co-authored-by: Richard Liaw <[email protected]> Co-authored-by: Sven Mika <[email protected]> Co-authored-by: Antoni Baum <[email protected]> Co-authored-by: Rohan138 <[email protected]> Co-authored-by: amavilla <[email protected]> Co-authored-by: Jiao <[email protected]> Co-authored-by: Julius Frost <[email protected]> Co-authored-by: Eric Liang <[email protected]> Co-authored-by: kk-55 <[email protected]> Co-authored-by: Yi Cheng <[email protected]> Co-authored-by: matthewdeng <[email protected]>

Yard1 added 13 commits July 22, 2021 19:05

Formatting

91d8c82

Fix

e60673a

Update permissions

f035376

Subnet config

9689445

Fix

b6524fa

Fix wait

b31c2fb

Increase max polls

25242d5

Logging, fix

5908643

Fix

155e409

Logging

933a2fd

Fix

1705336

Remove debug

7e3522a

Lint

8b5f61c

Yard1 added the do-not-merge Do not merge this PR! label Jul 22, 2021

Yard1 requested review from richardliaw and ijrsvt July 22, 2021 23:38

ijrsvt reviewed Jul 23, 2021

View reviewed changes

python/ray/autoscaler/_private/gcp/config.py Outdated Show resolved Hide resolved

python/ray/autoscaler/_private/gcp/config.py Show resolved Hide resolved

python/ray/autoscaler/_private/gcp/node_provider.py Outdated Show resolved Hide resolved

Do not create TPU resource if not necessary

fb79ceb

Yard1 requested a review from ijrsvt July 23, 2021 19:18

Yard1 added 5 commits July 23, 2021 21:29

Retry set_labels on tpu

50875cf

Comments

e95cbe2

Raise exception if head is a TPU

8fd467e

MAX_POLLS for TPUs

08aaca7

Logging tweak

ce97e1c

Yard1 added 2 commits July 24, 2021 14:31

Don't override networkConfig if provided

9d28b42

Fix

6c7b5bf

hartikainen reviewed Jul 24, 2021

View reviewed changes

python/ray/autoscaler/_private/gcp/node.py Outdated Show resolved Hide resolved

Add link in comment

ae40c2f

Yard1 added 5 commits July 26, 2021 14:32

Update example config

362260a

Add comments

b5e29d3

Tweak example config

14bdfa6

Update config

0b9a662

Raise exception with TPU pods

02e276f

Yard1 marked this pull request as ready for review July 27, 2021 15:49

Yard1 changed the title ~~[WIP][autoscaler] GCP TPU VM autoscaler~~ [autoscaler] GCP TPU VM autoscaler Jul 27, 2021

Yard1 removed the do-not-merge Do not merge this PR! label Jul 27, 2021