Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Add more accurate worker exit #24468

Merged
merged 21 commits into from
May 20, 2022

Conversation

rkooo567
Copy link
Contributor

@rkooo567 rkooo567 commented May 4, 2022

Why are these changes needed?

I still need to add more tests, but the PR itself is ready to be reviewed

This PR adds precise reason details regarding worker failures. All information is available either by

  • ray list workers
  • exceptions from actor failures.

Here's an example when the actor is killed by a SIGKILL (e.g., OOM killer)

RayActorError: The actor died unexpectedly before finishing this task.
	class_name: G
	actor_id: e818d2f0521a334daf03540701000000
	pid: 61251
	namespace: 674a49b2-5b9b-4fcc-b6e1-5a1d4b9400d2
	ip: 127.0.0.1
The actor is dead because its worker process has died. Worker exit type: UNEXPECTED_SYSTEM_EXIT Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.

Design

Worker failures are reported by Raylet from 2 paths.
(1) When the core worker calls Disconnect.
(2) When the worker is unexpectedly killed, the socket is closed, raylet reports the worker failures.

The PR ensures all worker failures are reported through Disconnect while it includes more detailed information to its metadata.

Exit types

Previously, the worker exit types are arbitrary and not correctly categorized. This PR reduces the number of worker exit types while it includes details of each exit type so that users can easily figure out the root cause of worker crashes.

Status quo

  • SYSTEM ERROR EXIT
    • Failure from the connection (core worker dead)
    • Unexpected exception or exit with exit_code !=0 on core worker
    • Direct call failure
  • INTENDED EXIT
    • Shutdown driver
    • Exit_actor
    • exit(0)
    • Actor kill request
    • Task cancel request
  • UNUSED_RESOURCE_REMOVED
    • Upon GCS restart, it kills bundles that are not registered to GCS to synchronize the state
  • PG_REMOVED
    • When pg is removed, all workers fate share
  • CREATION_TASK (INIT ERROR)
    • When actor init has an error
  • IDLE
    • When worker is idle and num workers > soft limit (by default num cpus)
  • NODE DIED
    • Only can detect when the node of the owner is dead (need improvement)

New proposal

Remove unnecessary states and add “details” field. We can categorize failures by 4 types

  • UNEXPECTED_SYSTEM_ERROR_EXIT
    • When the worker is crashed unexpectedly
    • Failure from the connection (core worker dead)
    • Unexpected exception or exit with exit_code !=0 on core worker
    • Node died
    • Direct call failure
  • INTENDED_USER_EXIT.
    • When the worker is requested to be killed by users. No workflow required. Just correctly store the state.
    • Shutdown driver
    • Exit_actor
    • exit(0)
    • Actor kill request
    • Task cancel request
  • INTENDED_SYSTEM_EXIT
    • When the worker is requested to be killed by system (without explicit user request)
    • Unused resource removed
    • Pg removed
    • Idle
  • ACTOR_INIT_FAILURE (CREATION_TASK_FAILED)
    • When the actor init is failed, we fate share the process with the actor.
    • Actor init failed

Limitation (Follow up)

Worker failures are not reported under following circumstances

  • Worker is failed before it registers its information to GCS (it is usually from critical system bug, and extremely uncommon).
  • Node is failed. In this case, we should track Node ID -> Worker ID mapping at GCS and when the node is failed, we should record worker metadata.

I will create issues to track these problems.

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

rkooo567 added a commit that referenced this pull request May 4, 2022
Add 30 seconds grace period before raising an exception from this test failure (https://console.anyscale.com/o/anyscale-internal/projects/prj_2xR6uT6t7jJuu1aCwWMsle/clusters/ses_1FL4g3cMg1wYifWf52tAaWtJ?command-history-section=command_history). What I'd like to see is some sort of error messages are propagated to the driver if this is due to some unexpected issues.

Note that this PR also adds more detailed exit information to all worker failures, but this is still WIP #24468
@rkooo567 rkooo567 changed the title [WIP] Add more accurate worker exit [Core] Add more accurate worker exit May 5, 2022
@WangTaoTheTonic
Copy link
Contributor

Found many SANG-TODOs, is this ready for review?

@rkooo567
Copy link
Contributor Author

rkooo567 commented May 6, 2022

@WangTaoTheTonic not yet. I am doing the final refinement & adding tests now

@@ -630,19 +630,19 @@ message MetricPoint {
// Type of a worker exit.
enum WorkerExitType {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can simplify it to:
SYSTEM_EXIT,
USER_EXIT,
SYSTEM_ERROR,
USER_ERROR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm in this case where does INTENTIONAL_SYSTEM_EXIT belong to?

UNEXPECTED_SYSTEM_EXIT -> SYSTEM_EXIT
INTENTIONAL_USER_EXIT -> USER_EXIT
ACTOR_INIT_FAILURE_EXIT -> USER_ERROR
INTENTIONAL_SYSTEM_EXIT -> ? It doesn't seem to be SYSTEM_ERROR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EXIT == Intentional
ERROR == Unexpected

cc @jjyao when you read "Exit vs Error", did you feel it was clear?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I did

INTENDED_SYSTEM_EXIT,
INTENDED_USER_EXIT,
SYSTEM_ERROR,
USER_ERROR

I think if I add INTENDED, it is clearer to understand the enum (since system exit usually includes unexpected exit iiuc)

// Type of this worker.
WorkerType worker_type = 5;
// This is for AddWorker.
map<string, bytes> worker_info = 6;
// The exception thrown in creation task. This field is set if this worker died because
// of exception thrown in actor's creation task. Only applies when is_alive=false.
RayException creation_task_exception = 18;
// Whether it's an intentional disconnect, only applies then `is_alive` is false.
optional WorkerExitType exit_type = 19;
Copy link
Contributor

@scv119 scv119 May 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm this might cause backward compatibility issue. can you use a different name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm should we care this now? it requires huge refactoring, and I feel like backward compatibility problems will occur from other parts from these versions since we don't have proper design there yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Talked offline. This is okay because it is the message that doesn't care backward compatibility for autoscaler Also cc @wuisawesome

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep i don't believe the autoscaler uses this message

src/ray/core_worker/core_worker.cc Outdated Show resolved Hide resolved
src/ray/core_worker/core_worker.cc Outdated Show resolved Hide resolved
src/ray/core_worker/core_worker.cc Show resolved Hide resolved
src/ray/core_worker/core_worker.cc Outdated Show resolved Hide resolved
@scv119 scv119 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label May 9, 2022
exit_type,
detail = std::move(detail),
creation_task_exception_pb_bytes]() {
Disconnect(exit_type, detail, creation_task_exception_pb_bytes);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, we should disconnect regardless of exit type so that the raylet will be informed about worker exit.

src/ray/core_worker/core_worker.cc Outdated Show resolved Hide resolved
@@ -630,19 +630,19 @@ message MetricPoint {
// Type of a worker exit.
enum WorkerExitType {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm in this case where does INTENTIONAL_SYSTEM_EXIT belong to?

UNEXPECTED_SYSTEM_EXIT -> SYSTEM_EXIT
INTENTIONAL_USER_EXIT -> USER_EXIT
ACTOR_INIT_FAILURE_EXIT -> USER_ERROR
INTENTIONAL_SYSTEM_EXIT -> ? It doesn't seem to be SYSTEM_ERROR

// Type of this worker.
WorkerType worker_type = 5;
// This is for AddWorker.
map<string, bytes> worker_info = 6;
// The exception thrown in creation task. This field is set if this worker died because
// of exception thrown in actor's creation task. Only applies when is_alive=false.
RayException creation_task_exception = 18;
// Whether it's an intentional disconnect, only applies then `is_alive` is false.
optional WorkerExitType exit_type = 19;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm should we care this now? it requires huge refactoring, and I feel like backward compatibility problems will occur from other parts from these versions since we don't have proper design there yet.

rkooo567 added a commit to rkooo567/ray that referenced this pull request May 12, 2022
…ct#24469)

Add 30 seconds grace period before raising an exception from this test failure (https://console.anyscale.com/o/anyscale-internal/projects/prj_2xR6uT6t7jJuu1aCwWMsle/clusters/ses_1FL4g3cMg1wYifWf52tAaWtJ?command-history-section=command_history). What I'd like to see is some sort of error messages are propagated to the driver if this is due to some unexpected issues.

Note that this PR also adds more detailed exit information to all worker failures, but this is still WIP ray-project#24468
@rkooo567 rkooo567 mentioned this pull request May 12, 2022
6 tasks
@rkooo567 rkooo567 removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label May 12, 2022
@rkooo567
Copy link
Contributor Author

All comments are addressed!

@rkooo567
Copy link
Contributor Author

@jjyao @scv119 can you guys take a look at this PR?

@rkooo567
Copy link
Contributor Author

Doc test and other tests failiure seem unrelated

@rkooo567 rkooo567 merged commit d89c8aa into ray-project:master May 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants