-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Better propogate node death information #45128
Conversation
Signed-off-by: Rui Qiao <[email protected]>
Signed-off-by: Rui Qiao <[email protected]>
Signed-off-by: Rui Qiao <[email protected]>
Signed-off-by: Rui Qiao <[email protected]>
Signed-off-by: Rui Qiao <[email protected]>
Signed-off-by: Rui Qiao <[email protected]>
Signed-off-by: Rui Qiao <[email protected]>
Signed-off-by: Rui Qiao <[email protected]>
Signed-off-by: Rui Qiao <[email protected]>
Signed-off-by: Rui Qiao <[email protected]>
Signed-off-by: Rui Qiao <[email protected]>
Signed-off-by: Rui Qiao <[email protected]>
Signed-off-by: Rui Qiao <[email protected]>
Signed-off-by: Rui Qiao <[email protected]>
Signed-off-by: Rui Qiao <[email protected]>
Signed-off-by: Rui Qiao <[email protected]>
Signed-off-by: Rui Qiao <[email protected]>
Signed-off-by: Rui Qiao <[email protected]>
Signed-off-by: Rui Qiao <[email protected]>
Signed-off-by: Rui Qiao <[email protected]>
Signed-off-by: Rui Qiao <[email protected]>
Signed-off-by: Rui Qiao <[email protected]>
This reverts commit dd3d285. Signed-off-by: Rui Qiao <[email protected]>
This reverts commit 7acc550. Signed-off-by: Rui Qiao <[email protected]>
Signed-off-by: Rui Qiao <[email protected]>
Signed-off-by: Rui Qiao <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LG
Signed-off-by: Rui Qiao <[email protected]>
Signed-off-by: Rui Qiao <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add some tests
Signed-off-by: Rui Qiao <[email protected]>
Signed-off-by: Rui Qiao <[email protected]>
Signed-off-by: Rui Qiao <[email protected]>
Signed-off-by: Rui Qiao <[email protected]>
Signed-off-by: Rui Qiao <[email protected]>
Signed-off-by: Rui Qiao <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Signed-off-by: Rui Qiao <[email protected]>
Signed-off-by: Rui Qiao <[email protected]>
Signed-off-by: Rui Qiao <[email protected]>
Signed-off-by: Rui Qiao <[email protected]>
|
Signed-off-by: Rui Qiao <[email protected]>
Signed-off-by: Rui Qiao <[email protected]>
Signed-off-by: Rui Qiao <[email protected]>
Why are these changes needed?
This is the first PR of a series to better propagate and expose node death information.
Background: in Ray, a node can be either ALIVE or DEAD, and the death reason of a node could be unexpected (e.g., crash) or expected (e.g., idle termination or spot preemption). Currently, GCS knows the actual death reason, but this information is not shown to the users. As a result, users might think a node crashes when they see it is DEAD, while in reality these could be expected scenarios such as spot preemption. This may give users the wrong impression that Ray is unstable. In the series of changes, we are going to better propagate and expose the node death info.
This PR does the following:
reason_message
toNodeDeathInfo
so that we track more verbose node death infoUnregisterNode
PRC in GCS and use it instead ofDrainNode
RPC wherever appropriateThis PR is the basis for future changes where we:
See more details in design doc:
https://docs.google.com/document/d/1tn6Uj-SoEAaBu5_HWl4dNo3JBzTd_w09RPsIWQpUU_s/edit
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.