[Core] Add more accurate worker exit #24468

rkooo567 · 2022-05-04T16:41:47Z

Why are these changes needed?

I still need to add more tests, but the PR itself is ready to be reviewed

This PR adds precise reason details regarding worker failures. All information is available either by

ray list workers
exceptions from actor failures.

Here's an example when the actor is killed by a SIGKILL (e.g., OOM killer)

RayActorError: The actor died unexpectedly before finishing this task.
	class_name: G
	actor_id: e818d2f0521a334daf03540701000000
	pid: 61251
	namespace: 674a49b2-5b9b-4fcc-b6e1-5a1d4b9400d2
	ip: 127.0.0.1
The actor is dead because its worker process has died. Worker exit type: UNEXPECTED_SYSTEM_EXIT Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.

Design

Worker failures are reported by Raylet from 2 paths.
(1) When the core worker calls Disconnect.
(2) When the worker is unexpectedly killed, the socket is closed, raylet reports the worker failures.

The PR ensures all worker failures are reported through Disconnect while it includes more detailed information to its metadata.

Exit types

Previously, the worker exit types are arbitrary and not correctly categorized. This PR reduces the number of worker exit types while it includes details of each exit type so that users can easily figure out the root cause of worker crashes.

Status quo

SYSTEM ERROR EXIT
- Failure from the connection (core worker dead)
- Unexpected exception or exit with exit_code !=0 on core worker
- Direct call failure
INTENDED EXIT
- Shutdown driver
- Exit_actor
- exit(0)
- Actor kill request
- Task cancel request
UNUSED_RESOURCE_REMOVED
- Upon GCS restart, it kills bundles that are not registered to GCS to synchronize the state
PG_REMOVED
- When pg is removed, all workers fate share
CREATION_TASK (INIT ERROR)
- When actor init has an error
IDLE
- When worker is idle and num workers > soft limit (by default num cpus)
NODE DIED
- Only can detect when the node of the owner is dead (need improvement)

New proposal

Remove unnecessary states and add “details” field. We can categorize failures by 4 types

UNEXPECTED_SYSTEM_ERROR_EXIT
- When the worker is crashed unexpectedly
- Failure from the connection (core worker dead)
- Unexpected exception or exit with exit_code !=0 on core worker
- Node died
- Direct call failure
INTENDED_USER_EXIT.
- When the worker is requested to be killed by users. No workflow required. Just correctly store the state.
- Shutdown driver
- Exit_actor
- exit(0)
- Actor kill request
- Task cancel request
INTENDED_SYSTEM_EXIT
- When the worker is requested to be killed by system (without explicit user request)
- Unused resource removed
- Pg removed
- Idle
ACTOR_INIT_FAILURE (CREATION_TASK_FAILED)
- When the actor init is failed, we fate share the process with the actor.
- Actor init failed

Limitation (Follow up)

Worker failures are not reported under following circumstances

Worker is failed before it registers its information to GCS (it is usually from critical system bug, and extremely uncommon).
Node is failed. In this case, we should track Node ID -> Worker ID mapping at GCS and when the node is failed, we should record worker metadata.

I will create issues to track these problems.

Related issue number

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Add 30 seconds grace period before raising an exception from this test failure (https://console.anyscale.com/o/anyscale-internal/projects/prj_2xR6uT6t7jJuu1aCwWMsle/clusters/ses_1FL4g3cMg1wYifWf52tAaWtJ?command-history-section=command_history). What I'd like to see is some sort of error messages are propagated to the driver if this is due to some unexpected issues. Note that this PR also adds more detailed exit information to all worker failures, but this is still WIP #24468

WangTaoTheTonic · 2022-05-06T03:54:18Z

Found many SANG-TODOs, is this ready for review?

rkooo567 · 2022-05-06T05:44:45Z

@WangTaoTheTonic not yet. I am doing the final refinement & adding tests now

scv119 · 2022-05-06T19:19:54Z

src/ray/protobuf/common.proto

@@ -630,19 +630,19 @@ message MetricPoint {
 // Type of a worker exit.
 enum WorkerExitType {


maybe we can simplify it to:
SYSTEM_EXIT,
USER_EXIT,
SYSTEM_ERROR,
USER_ERROR

Hmm in this case where does INTENTIONAL_SYSTEM_EXIT belong to?

UNEXPECTED_SYSTEM_EXIT -> SYSTEM_EXIT
INTENTIONAL_USER_EXIT -> USER_EXIT
ACTOR_INIT_FAILURE_EXIT -> USER_ERROR
INTENTIONAL_SYSTEM_EXIT -> ? It doesn't seem to be SYSTEM_ERROR

EXIT == Intentional
ERROR == Unexpected

cc @jjyao when you read "Exit vs Error", did you feel it was clear?

Actually, I did

INTENDED_SYSTEM_EXIT,
INTENDED_USER_EXIT,
SYSTEM_ERROR,
USER_ERROR

I think if I add INTENDED, it is clearer to understand the enum (since system exit usually includes unexpected exit iiuc)

scv119 · 2022-05-06T19:26:00Z

src/ray/protobuf/gcs.proto

  // Type of this worker.
  WorkerType worker_type = 5;
  // This is for AddWorker.
  map<string, bytes> worker_info = 6;
  // The exception thrown in creation task. This field is set if this worker died because
  // of exception thrown in actor's creation task. Only applies when is_alive=false.
  RayException creation_task_exception = 18;
+  // Whether it's an intentional disconnect, only applies then `is_alive` is false.
+  optional WorkerExitType exit_type = 19;


hmm this might cause backward compatibility issue. can you use a different name?

Hmm should we care this now? it requires huge refactoring, and I feel like backward compatibility problems will occur from other parts from these versions since we don't have proper design there yet.

Talked offline. This is okay because it is the message that doesn't care backward compatibility for autoscaler Also cc @wuisawesome

yep i don't believe the autoscaler uses this message

src/ray/core_worker/core_worker.cc

rkooo567 · 2022-05-06T07:26:12Z

src/ray/core_worker/core_worker.cc

+             exit_type,
+             detail = std::move(detail),
+             creation_task_exception_pb_bytes]() {
+              Disconnect(exit_type, detail, creation_task_exception_pb_bytes);


Ideally, we should disconnect regardless of exit type so that the raylet will be informed about worker exit.

src/ray/core_worker/core_worker.cc

rkooo567 · 2022-05-10T16:16:03Z

src/ray/protobuf/common.proto

@@ -630,19 +630,19 @@ message MetricPoint {
 // Type of a worker exit.
 enum WorkerExitType {


Hmm in this case where does INTENTIONAL_SYSTEM_EXIT belong to?

UNEXPECTED_SYSTEM_EXIT -> SYSTEM_EXIT
INTENTIONAL_USER_EXIT -> USER_EXIT
ACTOR_INIT_FAILURE_EXIT -> USER_ERROR
INTENTIONAL_SYSTEM_EXIT -> ? It doesn't seem to be SYSTEM_ERROR

rkooo567 · 2022-05-10T16:17:20Z

src/ray/protobuf/gcs.proto

  // Type of this worker.
  WorkerType worker_type = 5;
  // This is for AddWorker.
  map<string, bytes> worker_info = 6;
  // The exception thrown in creation task. This field is set if this worker died because
  // of exception thrown in actor's creation task. Only applies when is_alive=false.
  RayException creation_task_exception = 18;
+  // Whether it's an intentional disconnect, only applies then `is_alive` is false.
+  optional WorkerExitType exit_type = 19;


Hmm should we care this now? it requires huge refactoring, and I feel like backward compatibility problems will occur from other parts from these versions since we don't have proper design there yet.

…ct#24469) Add 30 seconds grace period before raising an exception from this test failure (https://console.anyscale.com/o/anyscale-internal/projects/prj_2xR6uT6t7jJuu1aCwWMsle/clusters/ses_1FL4g3cMg1wYifWf52tAaWtJ?command-history-section=command_history). What I'd like to see is some sort of error messages are propagated to the driver if this is due to some unexpected issues. Note that this PR also adds more detailed exit information to all worker failures, but this is still WIP ray-project#24468

rkooo567 · 2022-05-16T13:31:57Z

All comments are addressed!

rkooo567 · 2022-05-17T14:18:07Z

@jjyao @scv119 can you guys take a look at this PR?

rkooo567 · 2022-05-20T02:48:23Z

Doc test and other tests failiure seem unrelated

rkooo567 added 2 commits May 3, 2022 18:20

In progress

2a25d5f

Still in progress

2d6708f

rkooo567 requested review from wuisawesome, ericl, AmeerHajAli, robertnishihara, pcmoritz, raulchen, fishbone, scv119 and mwtian as code owners May 4, 2022 16:41

rkooo567 mentioned this pull request May 4, 2022

[Test] Add grace period to long running actor test failure #24469

Merged

6 tasks

rkooo567 added 2 commits May 4, 2022 19:53

Merge branch 'master' into add-more-accurate-worker-exit

5250784

Make it work.

1d6e193

rkooo567 changed the title ~~[WIP] Add more accurate worker exit~~ [Core] Add more accurate worker exit May 5, 2022

rkooo567 added 2 commits May 5, 2022 14:13

Fix java compilation issue

1f8f8ba

Remove an unnecessary file

4361149

rkooo567 added 3 commits May 6, 2022 02:21

Done except tests.

ce6131a

Fix test failures.

9af9d6d

lint

20323fd

rkooo567 assigned scv119, jjyao and WangTaoTheTonic May 6, 2022

scv119 reviewed May 6, 2022

View reviewed changes

scv119 reviewed May 9, 2022

View reviewed changes

src/ray/core_worker/core_worker.cc Outdated Show resolved Hide resolved

src/ray/core_worker/core_worker.cc Outdated Show resolved Hide resolved

src/ray/core_worker/core_worker.cc Show resolved Hide resolved

src/ray/core_worker/core_worker.cc Outdated Show resolved Hide resolved

scv119 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label May 9, 2022

rkooo567 commented May 10, 2022

View reviewed changes

rkooo567 mentioned this pull request May 12, 2022

[DO NOT MERGE] #24716

Closed

6 tasks

rkooo567 added 3 commits May 12, 2022 04:56

Merge branch 'master' into add-more-accurate-worker-exit

435830e

Addressed the initial code review.

42f4bb1

Address another code review.

17469ed

rkooo567 removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label May 12, 2022

rkooo567 added 2 commits May 17, 2022 18:57

Merge branch 'master' into add-more-accurate-worker-exit

f62e9d1

in progress

90ea33e

scv119 approved these changes May 18, 2022

View reviewed changes

rkooo567 added 7 commits May 18, 2022 17:08

Add tests.

18d9150

Done

f3d3236

Merge branch 'master' into add-more-accurate-worker-exit

298bcc2

Finish lint

412adbc

in progress

3a2d1f6

Merge branch 'master' into add-more-accurate-worker-exit

f8c1f6e

Fix a bug.

8f26cda

rkooo567 mentioned this pull request May 20, 2022

[flakey] Deflakey test_gcs_fault_tollerance.py #24981

Merged

6 tasks

rkooo567 merged commit d89c8aa into ray-project:master May 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] Add more accurate worker exit #24468

[Core] Add more accurate worker exit #24468

rkooo567 commented May 4, 2022 •

edited

Loading

WangTaoTheTonic commented May 6, 2022

rkooo567 commented May 6, 2022

scv119 May 6, 2022

rkooo567 May 10, 2022

rkooo567 May 18, 2022

rkooo567 May 18, 2022

scv119 May 6, 2022 •

edited

Loading

rkooo567 May 10, 2022

rkooo567 May 12, 2022

wuisawesome May 12, 2022

rkooo567 May 6, 2022

rkooo567 May 10, 2022

rkooo567 May 10, 2022

rkooo567 commented May 16, 2022

rkooo567 commented May 17, 2022

rkooo567 commented May 20, 2022

		@@ -630,19 +630,19 @@ message MetricPoint {
		// Type of a worker exit.
		enum WorkerExitType {

[Core] Add more accurate worker exit #24468

[Core] Add more accurate worker exit #24468

Conversation

rkooo567 commented May 4, 2022 • edited Loading

Why are these changes needed?

Design

Exit types

Status quo

New proposal

Limitation (Follow up)

Related issue number

Checks

WangTaoTheTonic commented May 6, 2022

rkooo567 commented May 6, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scv119 May 6, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rkooo567 commented May 16, 2022

rkooo567 commented May 17, 2022

rkooo567 commented May 20, 2022

rkooo567 commented May 4, 2022 •

edited

Loading

scv119 May 6, 2022 •

edited

Loading