-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Autoscaler] gcp parallel terminate nodes #34455
[Autoscaler] gcp parallel terminate nodes #34455
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution @Dan-Yeh!
I think the overall aim of the PR (parallelize node deletion) makes sense to me at a high level.
My main concern is around the locking.
I see how having terminate_node
acquire a global lock would cause contention and effectively lead to serializing the requests.
With this new locking scheme, I'm concerned that we're introducing race condition around the creation/deletion of nodes. Those operations seem like they inherently require a global lock.
I think a simple/safe way we could do this right now could be to keep the global lock and restructure the termination code like
class NodeProvider:
def _thread_unsafe_terminate_node(self, node_id):
# Assumes the global lock is held for the duration of this operation. The lock may be held by a different thread as is the case in `terminate_nodes()`.
...
def terminate_node(self, node_id):
with self.lock:
self._thread_unsafe_terminate_node(node_id)
def terminate_nodes(self, node_ids):
with self.lock, concurrent.futures.ThreadPoolExecutor() as executor:
executor.map(self._thread_unsafe_terminate_node, node_ids)
If you want to do the fine grained locking, I think we would need two phase locking. Something like
- Acquire global lock
- Add node id
- Acquire per-node locks
- Release global lock
and termination
- Acquire global lock
- Acquire per-node lock
- delete global entry
- release global lock
- release local lock
Let me know if that makes sense to you
Thanks for the suggestions! I think I would go with simple/safe way because it is hard to think about all possible scenarios for race condition or dead lock. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
@Dan-Yeh do you mind rebasing, that should make the ci failures go away, and then i think we're good to merge. |
Signed-off-by: Chen-Chen Yeh <[email protected]>
Signed-off-by: Chen-Chen Yeh <[email protected]>
Signed-off-by: Chen-Chen Yeh <[email protected]>
…possibiliy race condition and deadlock Signed-off-by: Chen-Chen Yeh <[email protected]>
Signed-off-by: Chen-Chen Yeh <[email protected]>
049f22a
to
51969a9
Compare
Sure. |
Whelp our CI has quite a few broken tests at the moment, but none of the test failures look related to this PR, so I'll merge now. Thanks @Dan-Yeh for the contribution! |
Why are these changes needed? ray down takes a lot of time when using GCPNodeProvider as stated in ray-project#26239 because GCPNodeProvider uses the serial implementation of terminate_nodes from parent class NodeProvider and also uses a coarse lock in its terminate_node which prevents executing it in a concurrent fashion (not really sure coz I'm new to this). add threadpoolexecutor in GCPNodeProvider.terminate_nodes for parallelization execution of terminate_node use fine-grained locks which assign one RLock per node_id add unit_tests why not go with the suggestions(batch apis and non-blocking version of terminate_node) mentioned in ray-project#26239? As a novice, I think both solutions would break Liskov Substitute Principle, and also for those who already used terminate_node(s) would need to add await. Related issue number ray-project#26239 --------- Signed-off-by: Chen-Chen Yeh <[email protected]> Co-authored-by: Chen-Chen Yeh <[email protected]> Signed-off-by: elliottower <[email protected]>
Why are these changes needed? ray down takes a lot of time when using GCPNodeProvider as stated in ray-project#26239 because GCPNodeProvider uses the serial implementation of terminate_nodes from parent class NodeProvider and also uses a coarse lock in its terminate_node which prevents executing it in a concurrent fashion (not really sure coz I'm new to this). add threadpoolexecutor in GCPNodeProvider.terminate_nodes for parallelization execution of terminate_node use fine-grained locks which assign one RLock per node_id add unit_tests why not go with the suggestions(batch apis and non-blocking version of terminate_node) mentioned in ray-project#26239? As a novice, I think both solutions would break Liskov Substitute Principle, and also for those who already used terminate_node(s) would need to add await. Related issue number ray-project#26239 --------- Signed-off-by: Chen-Chen Yeh <[email protected]> Co-authored-by: Chen-Chen Yeh <[email protected]> Signed-off-by: Jack He <[email protected]>
Why are these changes needed? ray down takes a lot of time when using GCPNodeProvider as stated in ray-project#26239 because GCPNodeProvider uses the serial implementation of terminate_nodes from parent class NodeProvider and also uses a coarse lock in its terminate_node which prevents executing it in a concurrent fashion (not really sure coz I'm new to this). add threadpoolexecutor in GCPNodeProvider.terminate_nodes for parallelization execution of terminate_node use fine-grained locks which assign one RLock per node_id add unit_tests why not go with the suggestions(batch apis and non-blocking version of terminate_node) mentioned in ray-project#26239? As a novice, I think both solutions would break Liskov Substitute Principle, and also for those who already used terminate_node(s) would need to add await. Related issue number ray-project#26239 --------- Signed-off-by: Chen-Chen Yeh <[email protected]> Co-authored-by: Chen-Chen Yeh <[email protected]>
Why are these changes needed?
ray down
takes a lot of time when using GCPNodeProvider as stated in #26239 because GCPNodeProvider uses the serial implementation ofterminate_nodes
from parent class NodeProvider and also uses a coarse lock in itsterminate_node
which prevents executing it in a concurrent fashion (not really sure coz I'm new to this).why not go with the suggestions(batch apis and non-blocking version of terminate_node) mentioned in #26239?
As a novice, I think both solutions would break Liskov Substitute Principle, and also for those who already used terminate_node(s) would need to add await.
Related issue number
#26239
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.