[Core] Fix worker process leaks after job finishes #44214

jjyao · 2024-03-21T15:31:11Z

Why are these changes needed?

This PR makes sure that when a job finishes, all worker processes (excluding those started by detached actors) belonging to it will forcibly exit. It fixes this by:

TaskSpec has a new ancestor_detached_actor_id to indicate whether the task's root is driver or detached actor.
Don't reuse workers for tasks whose ancestor is driver and tasks whose ancestor is detached actor so that we can safely kill workers when job finishes without impacting detached actors.
Fixed some bugs where workers are not properly killed when job finishes (see comments for each bug)

This PR doesn't fix the worker leak when detached actors finish. Ideally if we treat detached actors are separate jobs, the same code can just work but this change has bigger impact and higher risk (e.g. are we going to overload the job table since each serve replica is a detached actor, will we run out of job id which is 4 bytes) so I didn't include it here in this PR.

Related issue number

Closes #44897, #44931

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Jiajun Yao <[email protected]>

jjyao · 2024-04-25T20:18:28Z

src/ray/core_worker/core_worker.cc

        // If the worker is idle, we exit.
-        if (will_exit) {
+        if (force_exit) {


Bug: when force_exit is set, we still goes through the graceful exit code path which means if there is pending tasks or owning objects, the core worker will drain and exit which is not what we want when the job finishes.

jjyao · 2024-04-25T20:19:26Z

src/ray/core_worker/core_worker.cc

-  bool own_objects = reference_counter_->OwnObjects();
-  int64_t pins_in_flight = local_raylet_client_->GetPinsInFlight();
+  const bool own_objects = reference_counter_->OwnObjects();
+  const size_t num_pending_tasks = task_manager_->NumPendingTasks();


Bug: we didn't consider pending tasks when we decide whether the core worker is idle or not however we consider it during drain and exit.

Can we write a python test for this?

jjyao · 2024-04-25T20:23:39Z

src/ray/raylet/worker_pool.cc

+  if (worker && finished_jobs_.contains(task_spec.JobId()) &&
+      task_spec.AncestorDetachedActorId().IsNil()) {
+    RAY_CHECK(status == PopWorkerStatus::OK);
+    callback(nullptr, PopWorkerStatus::JobFinished, "");


Bug: when a job finishes, raylet will kill leased workers (one time) and idle workers (periodic). However there are workers don't belong to either of these two states: workers inside PopWorkerCallbackInternal that's sitting inside the event loop waiting to be added to the leased worker later. This fix makes sure this worker will go back to idle and will be killed by the periodic idle termination.

jjyao · 2024-04-25T20:28:44Z

src/ray/raylet/node_manager.cc

@@ -597,6 +597,25 @@ void NodeManager::HandleJobStarted(const JobID &job_id, const JobTableData &job_
 void NodeManager::HandleJobFinished(const JobID &job_id, const JobTableData &job_data) {
  RAY_LOG(DEBUG) << "HandleJobFinished " << job_id;
  RAY_CHECK(job_data.is_dead());
+  for (const auto &pair : leased_workers_) {


Bug: we only tried to kill worker processes when its parent dies (this works in theory if you consider the task graph as a tree where the root is the driver so when root dies, eventually the entire task graph should die but in practice due to worker reuse, we will have cycles so driver death won't transitively kill all workers ) and this PR fixes it by killing workers when job finishes as well.

jjyao · 2024-04-25T20:30:26Z

src/ray/raylet/worker_pool.cc

@@ -338,20 +347,17 @@ WorkerPool::BuildProcessCommandArgs(const Language &language,
    worker_command_args.push_back("--worker-launch-time-ms=" +
                                  std::to_string(current_sys_time_ms()));
    worker_command_args.push_back("--node-id=" + node_id_.Hex());
+    worker_command_args.push_back("--runtime-env-hash=" +


Bug: runtime-env-hash is actually worker cache key and includes more than just runtime env so we should always set it when starting a worker process not just when runtime env is not empty.

Signed-off-by: Jiajun Yao <[email protected]>

stephanie-wang

Hmm the number of bugs here is a bit scary :)

Can we add unit tests for these?

src/ray/protobuf/common.proto

src/ray/common/task/task_spec.h

stephanie-wang · 2024-04-30T01:50:09Z

src/ray/core_worker/core_worker.cc

-  bool own_objects = reference_counter_->OwnObjects();
-  int64_t pins_in_flight = local_raylet_client_->GetPinsInFlight();
+  const bool own_objects = reference_counter_->OwnObjects();
+  const size_t num_pending_tasks = task_manager_->NumPendingTasks();


Can we write a python test for this?

src/ray/raylet/local_task_manager.cc

stephanie-wang · 2024-04-30T01:54:59Z

src/ray/raylet/node_manager.cc

@@ -597,6 +597,25 @@ void NodeManager::HandleJobStarted(const JobID &job_id, const JobTableData &job_
 void NodeManager::HandleJobFinished(const JobID &job_id, const JobTableData &job_data) {
  RAY_LOG(DEBUG) << "HandleJobFinished " << job_id;
  RAY_CHECK(job_data.is_dead());
+  for (const auto &pair : leased_workers_) {


Signed-off-by: Jiajun Yao <[email protected]>

rkooo567

I think the fix makes sense. Need tests/comments to explain some code which we discussed offline

Signed-off-by: Jiajun Yao <[email protected]>

python/ray/_raylet.pyx

src/ray/raylet/worker_pool.cc

rkooo567 · 2024-05-08T13:44:34Z

src/ray/raylet/worker.h

@@ -205,6 +208,8 @@ class Worker : public WorkerInterface {
    lifetime_allocated_instances_ = allocated_instances;
  };

+  const ActorID &GetRootDetachedActorId() const { return root_detached_actor_id_; }


what does root detached actor mean?

If a task or actor is created by a detached actor (directly or transitively) then its root is the detached actor otherwise it's the driver.

src/ray/raylet/node_manager.cc

rkooo567 · 2024-05-08T13:45:52Z

src/ray/raylet/local_task_manager.cc

+        // The task job finished.
+        // Just remove the task from dispatch queue.
+        RAY_LOG(DEBUG) << "Call back to a job finished task, task id = " << task_id;
+        erase_from_dispatch_queue_fn(work, scheduling_class);


is this available in worker death reason btw?

We have

if (force_exit) { ForceExit(rpc::WorkerExitType::INTENDED_SYSTEM_EXIT, "Worker force exits because its job has finished");

rkooo567 · 2024-05-08T13:46:16Z

src/ray/protobuf/common.proto

+  // If this task is originated from a detached actor,
+  // this field contains the detached actor id.
+  // Otherwise it's empty and is originated from a driver.
+  bytes root_detached_actor_id = 40;


Suggested change

bytes root_detached_actor_id = 40;

bytes root_detached_actor_id = 40;

why don't we just call it from_detached_actor: bool?

This way we can differentiate work processes belonging to different detached actors.

src/ray/common/test/task_spec_test.cc

src/ray/raylet/scheduling/cluster_task_manager_test.cc

Signed-off-by: Jiajun Yao <[email protected]>

jjyao added 13 commits March 21, 2024 08:30

Fix worker process leaks after job finishes

461bc39

Signed-off-by: Jiajun Yao <[email protected]>

Merge branch 'master' of github.com:ray-project/ray into jjyao/leak

c5a613c

fix

b9227dd

Signed-off-by: Jiajun Yao <[email protected]>

up

612d354

Signed-off-by: Jiajun Yao <[email protected]>

Merge branch 'master' of github.com:ray-project/ray into jjyao/leak

601adf5

up

d1424f6

Signed-off-by: Jiajun Yao <[email protected]>

up

556cf94

Signed-off-by: Jiajun Yao <[email protected]>

Merge branch 'master' of github.com:ray-project/ray into jjyao/leak

95b914a

up

6a6d350

Signed-off-by: Jiajun Yao <[email protected]>

up

901ae40

Signed-off-by: Jiajun Yao <[email protected]>

up

3c32078

Signed-off-by: Jiajun Yao <[email protected]>

Merge branch 'master' of github.com:ray-project/ray into jjyao/leak

82fc8e2

up

82ebd76

Signed-off-by: Jiajun Yao <[email protected]>

jjyao commented Apr 25, 2024

View reviewed changes

jjyao marked this pull request as ready for review April 25, 2024 21:11

jjyao requested review from ericl, pcmoritz, raulchen and a team as code owners April 25, 2024 21:11

up

127ac91

Signed-off-by: Jiajun Yao <[email protected]>

rkooo567 self-assigned this Apr 29, 2024

jjyao mentioned this pull request Apr 29, 2024

[Ray core] Stopped job leaks worker #44897

Closed

stephanie-wang self-assigned this Apr 29, 2024

stephanie-wang reviewed Apr 30, 2024

View reviewed changes

jjyao added 2 commits April 30, 2024 11:12

rename

4621699

Signed-off-by: Jiajun Yao <[email protected]>

up

2070619

Signed-off-by: Jiajun Yao <[email protected]>

rkooo567 reviewed May 2, 2024

View reviewed changes

rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label May 2, 2024

up

74222b3

Signed-off-by: Jiajun Yao <[email protected]>

jjyao requested a review from suquark as a code owner May 6, 2024 21:43

jjyao added 3 commits May 6, 2024 21:51

tests

62f6e9e

Signed-off-by: Jiajun Yao <[email protected]>

Merge branch 'master' of github.com:ray-project/ray into jjyao/leak

ff7e2e6

tests

6d87283

Signed-off-by: Jiajun Yao <[email protected]>

jjyao removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label May 8, 2024

jjyao requested review from rkooo567 and stephanie-wang May 8, 2024 04:33

rkooo567 approved these changes May 8, 2024

View reviewed changes

stephanie-wang approved these changes May 8, 2024

View reviewed changes

src/ray/common/test/task_spec_test.cc Show resolved Hide resolved

src/ray/raylet/scheduling/cluster_task_manager_test.cc Show resolved Hide resolved

stephanie-wang added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label May 8, 2024

up

8a5de37

Signed-off-by: Jiajun Yao <[email protected]>

jjyao removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label May 11, 2024

jjyao merged commit 795c323 into ray-project:master May 11, 2024
5 checks passed

jjyao deleted the jjyao/leak branch May 11, 2024 14:11

rynewang mentioned this pull request May 30, 2024

[core][2/2] Kill worker on root detached actor died. #45638

Open

jjyao mentioned this pull request Jun 5, 2024

[Core] Worker process leak #44122

Closed

jjyao mentioned this pull request Jun 24, 2024

[Core] Returning an object that is >100KB from an Actor with max_task_retries>0 leaks IDLE workers #44931

Closed

Moonquakes mentioned this pull request Sep 6, 2024

[Ray Core] Task/resource not properly reclaimed after ray job exception or stopped #47531

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] Fix worker process leaks after job finishes #44214

[Core] Fix worker process leaks after job finishes #44214

jjyao commented Mar 21, 2024 •

edited

Loading

jjyao Apr 25, 2024

jjyao Apr 25, 2024

stephanie-wang Apr 30, 2024

jjyao Apr 25, 2024

jjyao Apr 25, 2024

stephanie-wang Apr 30, 2024

jjyao Apr 25, 2024

stephanie-wang left a comment

stephanie-wang Apr 30, 2024

stephanie-wang Apr 30, 2024

rkooo567 left a comment

rkooo567 May 8, 2024

jjyao May 8, 2024

rkooo567 May 8, 2024

jjyao May 8, 2024

rkooo567 May 8, 2024

jjyao May 8, 2024

	bytes root_detached_actor_id = 40;
	bytes root_detached_actor_id = 40;

[Core] Fix worker process leaks after job finishes #44214

[Core] Fix worker process leaks after job finishes #44214

Conversation

jjyao commented Mar 21, 2024 • edited Loading

Why are these changes needed?

Related issue number

Checks

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stephanie-wang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rkooo567 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jjyao commented Mar 21, 2024 •

edited

Loading