Skip to content

Commit

Permalink
[core] Deflakey test advanced 9 (ray-project#35247)
Browse files Browse the repository at this point in the history
Previously a bug was fixed in ray-project#33311 where pubsub causes the leak. Somehow the fix has race conditions and got triggered later when code changes.

The test is flakey because there is a race condition between raylet sending node failure and core worker exit itself.

When disconnect is sent to Raylet, Raylet will start to report worker failure. But the worker still continue to run.

GCS uses worker failure to close the connection. But if the worker is still alive, the worker might send another request the GCS which will lead to the FD leak.

Compare with ray-project#34883 it's a short term fix and the goal is to make the case the same as 2.3.

Signed-off-by: e428265 <[email protected]>
  • Loading branch information
fishbone authored and arvind-chandra committed Aug 31, 2023
1 parent ac77b10 commit f5563b8
Showing 1 changed file with 10 additions and 2 deletions.
12 changes: 10 additions & 2 deletions src/ray/core_worker/core_worker.cc
Original file line number Diff line number Diff line change
Expand Up @@ -790,8 +790,12 @@ void CoreWorker::Exit(
detail = std::move(detail),
creation_task_exception_pb_bytes]() {
rpc::DrainServerCallExecutor();
Disconnect(exit_type, detail, creation_task_exception_pb_bytes);
KillChildProcs();
// Disconnect should be put close to Shutdown
// https://github.com/ray-project/ray/pull/34883
// TODO (iycheng) Improve the Process.h and make it able to monitor
// process liveness
Disconnect(exit_type, detail, creation_task_exception_pb_bytes);
Shutdown();
},
"CoreWorker.Shutdown");
Expand Down Expand Up @@ -835,9 +839,13 @@ void CoreWorker::ForceExit(const rpc::WorkerExitType exit_type,
const std::string &detail) {
RAY_LOG(WARNING) << "Force exit the process. "
<< " Details: " << detail;
Disconnect(exit_type, detail);

KillChildProcs();
// Disconnect should be put close to Exit
// https://github.com/ray-project/ray/pull/34883
// TODO (iycheng) Improve the Process.h and make it able to monitor
// process liveness
Disconnect(exit_type, detail);

// NOTE(hchen): Use `QuickExit()` to force-exit this process without doing cleanup.
// `exit()` will destruct static objects in an incorrect order, which will lead to
Expand Down

0 comments on commit f5563b8

Please sign in to comment.