[Autoscaler] gcp parallel terminate nodes #34455

Dan-Yeh · 2023-04-17T00:51:15Z

Why are these changes needed?

ray down takes a lot of time when using GCPNodeProvider as stated in #26239 because GCPNodeProvider uses the serial implementation of terminate_nodes from parent class NodeProvider and also uses a coarse lock in its terminate_node which prevents executing it in a concurrent fashion (not really sure coz I'm new to this).

add threadpoolexecutor in GCPNodeProvider.terminate_nodes for parallelization execution of terminate_node
use fine-grained locks which assign one RLock per node_id
add unit_tests

why not go with the suggestions(batch apis and non-blocking version of terminate_node) mentioned in #26239?
As a novice, I think both solutions would break Liskov Substitute Principle, and also for those who already used terminate_node(s) would need to add await.

Related issue number

#26239

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

wuisawesome

Thanks for the contribution @Dan-Yeh!

I think the overall aim of the PR (parallelize node deletion) makes sense to me at a high level.

My main concern is around the locking.

I see how having terminate_node acquire a global lock would cause contention and effectively lead to serializing the requests.

With this new locking scheme, I'm concerned that we're introducing race condition around the creation/deletion of nodes. Those operations seem like they inherently require a global lock.

I think a simple/safe way we could do this right now could be to keep the global lock and restructure the termination code like

class NodeProvider:
  def _thread_unsafe_terminate_node(self, node_id):
     # Assumes the global lock is held for the duration of this operation. The lock may be held by a different thread as is the case in `terminate_nodes()`.
    ...

  def terminate_node(self, node_id):
    with self.lock:
      self._thread_unsafe_terminate_node(node_id)

  def terminate_nodes(self, node_ids):
    with self.lock, concurrent.futures.ThreadPoolExecutor() as executor:
      executor.map(self._thread_unsafe_terminate_node, node_ids)

If you want to do the fine grained locking, I think we would need two phase locking. Something like

Acquire global lock
Add node id
Acquire per-node locks
Release global lock

and termination

Acquire global lock
Acquire per-node lock
delete global entry
release global lock
release local lock

Let me know if that makes sense to you

Dan-Yeh · 2023-04-18T18:34:28Z

Thanks for the suggestions! I think I would go with simple/safe way because it is hard to think about all possible scenarios for race condition or dead lock.

wuisawesome

Thanks!

wuisawesome · 2023-04-18T20:45:10Z

@Dan-Yeh do you mind rebasing, that should make the ci failures go away, and then i think we're good to merge.

Signed-off-by: Chen-Chen Yeh <[email protected]>

…possibiliy race condition and deadlock Signed-off-by: Chen-Chen Yeh <[email protected]>

Signed-off-by: Chen-Chen Yeh <[email protected]>

Dan-Yeh · 2023-04-18T21:22:23Z

Sure.
Just did it, hope it's alright coz it's my first time using rebase :)

wuisawesome · 2023-04-21T13:57:03Z

Whelp our CI has quite a few broken tests at the moment, but none of the test failures look related to this PR, so I'll merge now.

Thanks @Dan-Yeh for the contribution!

Why are these changes needed? ray down takes a lot of time when using GCPNodeProvider as stated in ray-project#26239 because GCPNodeProvider uses the serial implementation of terminate_nodes from parent class NodeProvider and also uses a coarse lock in its terminate_node which prevents executing it in a concurrent fashion (not really sure coz I'm new to this). add threadpoolexecutor in GCPNodeProvider.terminate_nodes for parallelization execution of terminate_node use fine-grained locks which assign one RLock per node_id add unit_tests why not go with the suggestions(batch apis and non-blocking version of terminate_node) mentioned in ray-project#26239? As a novice, I think both solutions would break Liskov Substitute Principle, and also for those who already used terminate_node(s) would need to add await. Related issue number ray-project#26239 --------- Signed-off-by: Chen-Chen Yeh <[email protected]> Co-authored-by: Chen-Chen Yeh <[email protected]> Signed-off-by: elliottower <[email protected]>

Why are these changes needed? ray down takes a lot of time when using GCPNodeProvider as stated in ray-project#26239 because GCPNodeProvider uses the serial implementation of terminate_nodes from parent class NodeProvider and also uses a coarse lock in its terminate_node which prevents executing it in a concurrent fashion (not really sure coz I'm new to this). add threadpoolexecutor in GCPNodeProvider.terminate_nodes for parallelization execution of terminate_node use fine-grained locks which assign one RLock per node_id add unit_tests why not go with the suggestions(batch apis and non-blocking version of terminate_node) mentioned in ray-project#26239? As a novice, I think both solutions would break Liskov Substitute Principle, and also for those who already used terminate_node(s) would need to add await. Related issue number ray-project#26239 --------- Signed-off-by: Chen-Chen Yeh <[email protected]> Co-authored-by: Chen-Chen Yeh <[email protected]> Signed-off-by: Jack He <[email protected]>

Why are these changes needed? ray down takes a lot of time when using GCPNodeProvider as stated in ray-project#26239 because GCPNodeProvider uses the serial implementation of terminate_nodes from parent class NodeProvider and also uses a coarse lock in its terminate_node which prevents executing it in a concurrent fashion (not really sure coz I'm new to this). add threadpoolexecutor in GCPNodeProvider.terminate_nodes for parallelization execution of terminate_node use fine-grained locks which assign one RLock per node_id add unit_tests why not go with the suggestions(batch apis and non-blocking version of terminate_node) mentioned in ray-project#26239? As a novice, I think both solutions would break Liskov Substitute Principle, and also for those who already used terminate_node(s) would need to add await. Related issue number ray-project#26239 --------- Signed-off-by: Chen-Chen Yeh <[email protected]> Co-authored-by: Chen-Chen Yeh <[email protected]>

Dan-Yeh requested review from wuisawesome, DmitriGekhtman and ericl as code owners April 17, 2023 00:51

wuisawesome reviewed Apr 17, 2023

View reviewed changes

wuisawesome self-assigned this Apr 17, 2023

scv119 assigned rickyyx Apr 17, 2023

wuisawesome approved these changes Apr 18, 2023

View reviewed changes

Chen-Chen Yeh added 5 commits April 18, 2023 23:17

add thread pool version of terminate_nodes function

2f46cd9

Signed-off-by: Chen-Chen Yeh <[email protected]>

use fine-grained locks to reach real parallel terminate_nodes

a8b0175

Signed-off-by: Chen-Chen Yeh <[email protected]>

add unit tests

4c2ce3d

Signed-off-by: Chen-Chen Yeh <[email protected]>

implement termination using global lock and unsafe function to avoid …

3c7d0b2

…possibiliy race condition and deadlock Signed-off-by: Chen-Chen Yeh <[email protected]>

modify unit tests

51969a9

Signed-off-by: Chen-Chen Yeh <[email protected]>

Dan-Yeh force-pushed the autoscaler/gcp_parallel_terminate_nodes branch from 049f22a to 51969a9 Compare April 18, 2023 21:20

wuisawesome merged commit 46fc663 into ray-project:master Apr 21, 2023

wuisawesome mentioned this pull request May 4, 2023

[autoscaler][gcp] NodeProvider: Terminate nodes in parallel #26239

Open

concretevitamin mentioned this pull request May 26, 2023

Update GCP node provider to terminate nodes in parallel skypilot-org/skypilot#1983

Closed

Michaelvll mentioned this pull request Jul 9, 2023

[GCP] Adopt new provisioner to stop/down clusters skypilot-org/skypilot#2199

Merged

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Autoscaler] gcp parallel terminate nodes #34455

[Autoscaler] gcp parallel terminate nodes #34455

Dan-Yeh commented Apr 17, 2023 •

edited

Loading

wuisawesome left a comment

Dan-Yeh commented Apr 18, 2023 •

edited

Loading

wuisawesome left a comment

wuisawesome commented Apr 18, 2023

Dan-Yeh commented Apr 18, 2023

wuisawesome commented Apr 21, 2023

[Autoscaler] gcp parallel terminate nodes #34455

[Autoscaler] gcp parallel terminate nodes #34455

Conversation

Dan-Yeh commented Apr 17, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

wuisawesome left a comment

Choose a reason for hiding this comment

Dan-Yeh commented Apr 18, 2023 • edited Loading

wuisawesome left a comment

Choose a reason for hiding this comment

wuisawesome commented Apr 18, 2023

Dan-Yeh commented Apr 18, 2023

wuisawesome commented Apr 21, 2023

Dan-Yeh commented Apr 17, 2023 •

edited

Loading

Dan-Yeh commented Apr 18, 2023 •

edited

Loading