-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Add more accurate worker exit #24468
[Core] Add more accurate worker exit #24468
Conversation
Add 30 seconds grace period before raising an exception from this test failure (https://console.anyscale.com/o/anyscale-internal/projects/prj_2xR6uT6t7jJuu1aCwWMsle/clusters/ses_1FL4g3cMg1wYifWf52tAaWtJ?command-history-section=command_history). What I'd like to see is some sort of error messages are propagated to the driver if this is due to some unexpected issues. Note that this PR also adds more detailed exit information to all worker failures, but this is still WIP #24468
Found many |
@WangTaoTheTonic not yet. I am doing the final refinement & adding tests now |
@@ -630,19 +630,19 @@ message MetricPoint { | |||
// Type of a worker exit. | |||
enum WorkerExitType { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we can simplify it to:
SYSTEM_EXIT,
USER_EXIT,
SYSTEM_ERROR,
USER_ERROR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm in this case where does INTENTIONAL_SYSTEM_EXIT
belong to?
UNEXPECTED_SYSTEM_EXIT -> SYSTEM_EXIT
INTENTIONAL_USER_EXIT -> USER_EXIT
ACTOR_INIT_FAILURE_EXIT -> USER_ERROR
INTENTIONAL_SYSTEM_EXIT -> ? It doesn't seem to be SYSTEM_ERROR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EXIT == Intentional
ERROR == Unexpected
cc @jjyao when you read "Exit vs Error", did you feel it was clear?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I did
INTENDED_SYSTEM_EXIT,
INTENDED_USER_EXIT,
SYSTEM_ERROR,
USER_ERROR
I think if I add INTENDED, it is clearer to understand the enum (since system exit usually includes unexpected exit iiuc)
// Type of this worker. | ||
WorkerType worker_type = 5; | ||
// This is for AddWorker. | ||
map<string, bytes> worker_info = 6; | ||
// The exception thrown in creation task. This field is set if this worker died because | ||
// of exception thrown in actor's creation task. Only applies when is_alive=false. | ||
RayException creation_task_exception = 18; | ||
// Whether it's an intentional disconnect, only applies then `is_alive` is false. | ||
optional WorkerExitType exit_type = 19; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm this might cause backward compatibility issue. can you use a different name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm should we care this now? it requires huge refactoring, and I feel like backward compatibility problems will occur from other parts from these versions since we don't have proper design there yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Talked offline. This is okay because it is the message that doesn't care backward compatibility for autoscaler Also cc @wuisawesome
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep i don't believe the autoscaler uses this message
src/ray/core_worker/core_worker.cc
Outdated
exit_type, | ||
detail = std::move(detail), | ||
creation_task_exception_pb_bytes]() { | ||
Disconnect(exit_type, detail, creation_task_exception_pb_bytes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, we should disconnect regardless of exit type so that the raylet will be informed about worker exit.
@@ -630,19 +630,19 @@ message MetricPoint { | |||
// Type of a worker exit. | |||
enum WorkerExitType { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm in this case where does INTENTIONAL_SYSTEM_EXIT
belong to?
UNEXPECTED_SYSTEM_EXIT -> SYSTEM_EXIT
INTENTIONAL_USER_EXIT -> USER_EXIT
ACTOR_INIT_FAILURE_EXIT -> USER_ERROR
INTENTIONAL_SYSTEM_EXIT -> ? It doesn't seem to be SYSTEM_ERROR
// Type of this worker. | ||
WorkerType worker_type = 5; | ||
// This is for AddWorker. | ||
map<string, bytes> worker_info = 6; | ||
// The exception thrown in creation task. This field is set if this worker died because | ||
// of exception thrown in actor's creation task. Only applies when is_alive=false. | ||
RayException creation_task_exception = 18; | ||
// Whether it's an intentional disconnect, only applies then `is_alive` is false. | ||
optional WorkerExitType exit_type = 19; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm should we care this now? it requires huge refactoring, and I feel like backward compatibility problems will occur from other parts from these versions since we don't have proper design there yet.
…ct#24469) Add 30 seconds grace period before raising an exception from this test failure (https://console.anyscale.com/o/anyscale-internal/projects/prj_2xR6uT6t7jJuu1aCwWMsle/clusters/ses_1FL4g3cMg1wYifWf52tAaWtJ?command-history-section=command_history). What I'd like to see is some sort of error messages are propagated to the driver if this is due to some unexpected issues. Note that this PR also adds more detailed exit information to all worker failures, but this is still WIP ray-project#24468
All comments are addressed! |
Doc test and other tests failiure seem unrelated |
Why are these changes needed?
I still need to add more tests, but the PR itself is ready to be reviewed
This PR adds precise reason details regarding worker failures. All information is available either by
Here's an example when the actor is killed by a SIGKILL (e.g., OOM killer)
Design
Worker failures are reported by Raylet from 2 paths.
(1) When the core worker calls
Disconnect
.(2) When the worker is unexpectedly killed, the socket is closed, raylet reports the worker failures.
The PR ensures all worker failures are reported through Disconnect while it includes more detailed information to its metadata.
Exit types
Previously, the worker exit types are arbitrary and not correctly categorized. This PR reduces the number of worker exit types while it includes details of each exit type so that users can easily figure out the root cause of worker crashes.
Status quo
New proposal
Remove unnecessary states and add “details” field. We can categorize failures by 4 types
Limitation (Follow up)
Worker failures are not reported under following circumstances
I will create issues to track these problems.
Related issue number
Checks
scripts/format.sh
to lint the changes in this PR.