Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Autoscaler] gcp parallel terminate nodes #34455

Merged

Conversation

Dan-Yeh
Copy link
Contributor

@Dan-Yeh Dan-Yeh commented Apr 17, 2023

Why are these changes needed?

ray down takes a lot of time when using GCPNodeProvider as stated in #26239 because GCPNodeProvider uses the serial implementation of terminate_nodes from parent class NodeProvider and also uses a coarse lock in its terminate_node which prevents executing it in a concurrent fashion (not really sure coz I'm new to this).

  • add threadpoolexecutor in GCPNodeProvider.terminate_nodes for parallelization execution of terminate_node
  • use fine-grained locks which assign one RLock per node_id
  • add unit_tests

why not go with the suggestions(batch apis and non-blocking version of terminate_node) mentioned in #26239?
As a novice, I think both solutions would break Liskov Substitute Principle, and also for those who already used terminate_node(s) would need to add await.

Related issue number

#26239

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Copy link
Contributor

@wuisawesome wuisawesome left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution @Dan-Yeh!

I think the overall aim of the PR (parallelize node deletion) makes sense to me at a high level.

My main concern is around the locking.

I see how having terminate_node acquire a global lock would cause contention and effectively lead to serializing the requests.

With this new locking scheme, I'm concerned that we're introducing race condition around the creation/deletion of nodes. Those operations seem like they inherently require a global lock.

I think a simple/safe way we could do this right now could be to keep the global lock and restructure the termination code like

class NodeProvider:
  def _thread_unsafe_terminate_node(self, node_id):
     # Assumes the global lock is held for the duration of this operation. The lock may be held by a different thread as is the case in `terminate_nodes()`.
    ...

  def terminate_node(self, node_id):
    with self.lock:
      self._thread_unsafe_terminate_node(node_id)

  def terminate_nodes(self, node_ids):
    with self.lock, concurrent.futures.ThreadPoolExecutor() as executor:
      executor.map(self._thread_unsafe_terminate_node, node_ids)

If you want to do the fine grained locking, I think we would need two phase locking. Something like

  1. Acquire global lock
  2. Add node id
  3. Acquire per-node locks
  4. Release global lock

and termination

  1. Acquire global lock
  2. Acquire per-node lock
  3. delete global entry
  4. release global lock
  5. release local lock

Let me know if that makes sense to you

@wuisawesome wuisawesome self-assigned this Apr 17, 2023
@Dan-Yeh
Copy link
Contributor Author

Dan-Yeh commented Apr 18, 2023

Thanks for the suggestions! I think I would go with simple/safe way because it is hard to think about all possible scenarios for race condition or dead lock.

Copy link
Contributor

@wuisawesome wuisawesome left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@wuisawesome
Copy link
Contributor

@Dan-Yeh do you mind rebasing, that should make the ci failures go away, and then i think we're good to merge.

Chen-Chen Yeh added 5 commits April 18, 2023 23:17
Signed-off-by: Chen-Chen Yeh <[email protected]>
…possibiliy race condition and deadlock

Signed-off-by: Chen-Chen Yeh <[email protected]>
Signed-off-by: Chen-Chen Yeh <[email protected]>
@Dan-Yeh Dan-Yeh force-pushed the autoscaler/gcp_parallel_terminate_nodes branch from 049f22a to 51969a9 Compare April 18, 2023 21:20
@Dan-Yeh
Copy link
Contributor Author

Dan-Yeh commented Apr 18, 2023

Sure.
Just did it, hope it's alright coz it's my first time using rebase :)

@wuisawesome
Copy link
Contributor

Whelp our CI has quite a few broken tests at the moment, but none of the test failures look related to this PR, so I'll merge now.

Thanks @Dan-Yeh for the contribution!

@wuisawesome wuisawesome merged commit 46fc663 into ray-project:master Apr 21, 2023
elliottower pushed a commit to elliottower/ray that referenced this pull request Apr 22, 2023
Why are these changes needed?
ray down takes a lot of time when using GCPNodeProvider as stated in ray-project#26239 because GCPNodeProvider uses the serial implementation of terminate_nodes from parent class NodeProvider and also uses a coarse lock in its terminate_node which prevents executing it in a concurrent fashion (not really sure coz I'm new to this).

add threadpoolexecutor in GCPNodeProvider.terminate_nodes for parallelization execution of terminate_node
use fine-grained locks which assign one RLock per node_id
add unit_tests
why not go with the suggestions(batch apis and non-blocking version of terminate_node) mentioned in ray-project#26239?
As a novice, I think both solutions would break Liskov Substitute Principle, and also for those who already used terminate_node(s) would need to add await.

Related issue number
ray-project#26239

---------

Signed-off-by: Chen-Chen Yeh <[email protected]>
Co-authored-by: Chen-Chen Yeh <[email protected]>
Signed-off-by: elliottower <[email protected]>
ProjectsByJackHe pushed a commit to ProjectsByJackHe/ray that referenced this pull request May 4, 2023
Why are these changes needed?
ray down takes a lot of time when using GCPNodeProvider as stated in ray-project#26239 because GCPNodeProvider uses the serial implementation of terminate_nodes from parent class NodeProvider and also uses a coarse lock in its terminate_node which prevents executing it in a concurrent fashion (not really sure coz I'm new to this).

add threadpoolexecutor in GCPNodeProvider.terminate_nodes for parallelization execution of terminate_node
use fine-grained locks which assign one RLock per node_id
add unit_tests
why not go with the suggestions(batch apis and non-blocking version of terminate_node) mentioned in ray-project#26239?
As a novice, I think both solutions would break Liskov Substitute Principle, and also for those who already used terminate_node(s) would need to add await.

Related issue number
ray-project#26239

---------

Signed-off-by: Chen-Chen Yeh <[email protected]>
Co-authored-by: Chen-Chen Yeh <[email protected]>
Signed-off-by: Jack He <[email protected]>
architkulkarni pushed a commit to architkulkarni/ray that referenced this pull request May 16, 2023
Why are these changes needed?
ray down takes a lot of time when using GCPNodeProvider as stated in ray-project#26239 because GCPNodeProvider uses the serial implementation of terminate_nodes from parent class NodeProvider and also uses a coarse lock in its terminate_node which prevents executing it in a concurrent fashion (not really sure coz I'm new to this).

add threadpoolexecutor in GCPNodeProvider.terminate_nodes for parallelization execution of terminate_node
use fine-grained locks which assign one RLock per node_id
add unit_tests
why not go with the suggestions(batch apis and non-blocking version of terminate_node) mentioned in ray-project#26239?
As a novice, I think both solutions would break Liskov Substitute Principle, and also for those who already used terminate_node(s) would need to add await.

Related issue number
ray-project#26239

---------

Signed-off-by: Chen-Chen Yeh <[email protected]>
Co-authored-by: Chen-Chen Yeh <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants