-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[autoscaler] GCP TPU VM autoscaler #17278
Conversation
cc @hartikainen ;) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks pretty reasonable!
Exciting! Thanks for cc'ing me @richardliaw. I skimmed through the changes and they all looks reasonable to me 🙂 |
TPU board support is feature complete and working. TPU pods are not yet supported (will be in a follow-up PR). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice work! have you tested to make sure all the example gcp scripts still run properly?
@richardliaw Yes, all the previous cluster configs work fine (tested setup, automatic node creation, idle teardown and termination) and all of the existing compute functionality is unaffected. |
This PR looks OK to me; @ijrsvt can you review/merge? |
# We can't run autoscaling through a serviceAccount on TPUs (atm) | ||
if _is_head_node_a_tpu(config): | ||
raise RuntimeError("TPUs are not supported as head nodes.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this check!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM from my perspective.
Test failures look unrelated (Windows & non-autoscaler), so going ahead with merging! |
* Enforce per-node-type max workers * type annonation Co-authored-by: Ameer Haj Ali <[email protected]> * cleanup. comments. type annotations * additional type annotation Co-authored-by: Ameer Haj Ali <[email protected]> * additional cleanup. comments. type annotations * _get_nodes_needed_for_request_resources to use FrozenSet * comments * whitespace * [Placement Group] Fix resource index assignment between with bundle index and without bundle index pg (#17318) * [serve] Add Ray API stability annotations (#17295) * Support streaming output of runtime env setup to logger/driver (#17306) * [SGD] v2 prototype: ``WorkerGroup`` implementation (#17330) * wip * formatting * increase timeouts * address comments * comments * fix * address comments * Update python/ray/util/sgd/v2/worker_group.py Co-authored-by: Richard Liaw <[email protected]> * Update python/ray/util/sgd/v2/worker_group.py Co-authored-by: Richard Liaw <[email protected]> * address comments * formatting * fix * avoid race condition Co-authored-by: Richard Liaw <[email protected]> * [RLlib] Discussion 3001: Fix comment on internal state shape (must be [B x S=state dim]). (#17341) * [autoscaler] GCP TPU VM autoscaler (#17278) * [Rllib] set self._allow_unknown_config (#17335) Co-authored-by: Sven Mika <[email protected]> * [RLlib] Discussion 2294: Custom vector env example and fix. (#16083) * [docs] Link broken in Tune's page (#17394) (#17407) * [Serve] Fix response_model for class based view routes as well (#17376) * [serve] Fix single deployment nightly test (#17368) * [RLlib] SAC tuple observation space fix (#17356) * Support schema on read for csv/json (#17354) * [RLlib] New and changed version of parametric actions cartpole example + small suggested update in policy_client.py (#15664) * [gcs] Fix GCS related issues: ByteSizeLong and redis connection (#17373) * [runtime_env] Gracefully fail tasks when an environment fails to be set up (#17249) * [docs] update docs with pip requirements (#17317) * removed nodes_to_keep. cleanup * formatting * +comment * treat max_workers=0 as 0 workers (as opposed to unlimited) * fix wrong comment * warning for inconsistent config * terminate nodes with no matching node type right away * quotes * special handling for head node when enforcing max_workers per type. tests. cleanup * cleanup comments and prints * comments * cleanup. removed special handling of head node. * adding an eplicit non-None check in schedule_node_termination * raise the exception Co-authored-by: Ameer Haj Ali <[email protected]> Co-authored-by: Ameer Haj Ali <[email protected]> Co-authored-by: DK.Pino <[email protected]> Co-authored-by: Edward Oakes <[email protected]> Co-authored-by: Simon Mo <[email protected]> Co-authored-by: Amog Kamsetty <[email protected]> Co-authored-by: Richard Liaw <[email protected]> Co-authored-by: Sven Mika <[email protected]> Co-authored-by: Antoni Baum <[email protected]> Co-authored-by: Rohan138 <[email protected]> Co-authored-by: amavilla <[email protected]> Co-authored-by: Jiao <[email protected]> Co-authored-by: Julius Frost <[email protected]> Co-authored-by: Eric Liang <[email protected]> Co-authored-by: kk-55 <[email protected]> Co-authored-by: Yi Cheng <[email protected]> Co-authored-by: matthewdeng <[email protected]>
Why are these changes needed?
WIP!
This PR adds support for GCP TPU VMs in autoscaler. This is achieved by refactoring and abstracting the current GCP autoscaler code to allow for different types of resources, with different APIs (Compute and TPU).
This will allow for users to take advantage of GCP TPUs without having to hack support for them themselves, making projects such as https://github.com/kingoflolz/swarm-jax/tree/master/swarm_jax much simpler.
TPU pods are not yet supported (only v2-8 and v3-8 instances are supported).
Requires further testing. Not all functionality may work with TPUs. The existing (compute) functionality should be unaffected, unless by mistake. The provided config is very WIP.
Related issue number
Checks
scripts/format.sh
to lint the changes in this PR.