Skip to content

Commit

Permalink
[core] Deflakey test advanced 9 (ray-project#35247)
Browse files Browse the repository at this point in the history
Previously a bug was fixed in ray-project#33311 where pubsub causes the leak. Somehow the fix has race conditions and got triggered later when code changes.

The test is flakey because there is a race condition between raylet sending node failure and core worker exit itself.

When disconnect is sent to Raylet, Raylet will start to report worker failure. But the worker still continue to run.

GCS uses worker failure to close the connection. But if the worker is still alive, the worker might send another request the GCS which will lead to the FD leak.

Compare with ray-project#34883 it's a short term fix and the goal is to make the case the same as 2.3.
  • Loading branch information
fishbone authored and architkulkarni committed May 16, 2023
1 parent 0666204 commit 9316ce7
Showing 1 changed file with 10 additions and 2 deletions.
12 changes: 10 additions & 2 deletions src/ray/core_worker/core_worker.cc
Original file line number Diff line number Diff line change
Expand Up @@ -790,8 +790,12 @@ void CoreWorker::Exit(
detail = std::move(detail),
creation_task_exception_pb_bytes]() {
rpc::DrainServerCallExecutor();
Disconnect(exit_type, detail, creation_task_exception_pb_bytes);
KillChildProcs();
// Disconnect should be put close to Shutdown
// https://github.com/ray-project/ray/pull/34883
// TODO (iycheng) Improve the Process.h and make it able to monitor
// process liveness
Disconnect(exit_type, detail, creation_task_exception_pb_bytes);
Shutdown();
},
"CoreWorker.Shutdown");
Expand Down Expand Up @@ -835,9 +839,13 @@ void CoreWorker::ForceExit(const rpc::WorkerExitType exit_type,
const std::string &detail) {
RAY_LOG(WARNING) << "Force exit the process. "
<< " Details: " << detail;
Disconnect(exit_type, detail);

KillChildProcs();
// Disconnect should be put close to Exit
// https://github.com/ray-project/ray/pull/34883
// TODO (iycheng) Improve the Process.h and make it able to monitor
// process liveness
Disconnect(exit_type, detail);

// NOTE(hchen): Use `QuickExit()` to force-exit this process without doing cleanup.
// `exit()` will destruct static objects in an incorrect order, which will lead to
Expand Down

0 comments on commit 9316ce7

Please sign in to comment.