Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Clean up process leaks in worker processes, even if worker process crashes #34125

Closed
cadedaniel opened this issue Apr 6, 2023 · 3 comments
Assignees
Labels
core Issues that should be addressed in Ray Core core-correctness Leak, crash, hang P1 Issue that should be fixed within a few weeks Ray 2.5

Comments

@cadedaniel
Copy link
Member

In #33976 we fix certain worker child process leaks by listing them and killing them when the core worker shuts down. This is not a complete solution, as it will still leak processes if the worker process crashes.

We should have a better story around cleaning up leaked processes. One potentially solid implementation is to configure the raylet with PR_SET_CHILD_SUBREAPER, so that all orphaned descendant processes are re-parented to the raylet. The raylet would then periodically poll its immediate child processes and clean up any that are not known worker processes.

@cadedaniel cadedaniel added P1 Issue that should be fixed within a few weeks core Issues that should be addressed in Ray Core labels Apr 6, 2023
@cadedaniel cadedaniel self-assigned this Apr 6, 2023
@clarng
Copy link
Contributor

clarng commented Apr 6, 2023

this is important as the os or oom killer may kill the worker process, i.e. crash is part of 'normal' operation

@rkooo567
Copy link
Contributor

rkooo567 commented Apr 7, 2023

#26118 Duplicates

@rkooo567 rkooo567 added the core-correctness Leak, crash, hang label Apr 7, 2023
@cadedaniel
Copy link
Member Author

Closing in favor of #26118

rkooo567 pushed a commit that referenced this issue Aug 17, 2023
Summary
This PR has CoreWorkers kill their child processes when they notice the raylet has died. This fixes an edge case missing from #33976.

Details
The CoreWorker processes have a periodic check that validates the raylet process is alive. If the raylet is no longer alive, then the CoreWorker processes invoke QuickExit. This PR modifies this behavior so that before calling QuickExit, the CoreWorker processes kill all child processes.

Note that this does not address the case where CoreWorker processes crash or are killed with SIGKILL. That condition requires the subreaper work proposed in #34125 / #26118.

The ray stop --force case also requires the subreaper work. This is because --force kills CoreWorker processes with SIGKILL, leaving them no opportunity to clean up child processes. We can add the logic which catches leaked child processes, but the best design for this is the subreaper design.
arvind-chandra pushed a commit to lmco/ray that referenced this issue Aug 31, 2023
…ect#38439)

Summary
This PR has CoreWorkers kill their child processes when they notice the raylet has died. This fixes an edge case missing from ray-project#33976.

Details
The CoreWorker processes have a periodic check that validates the raylet process is alive. If the raylet is no longer alive, then the CoreWorker processes invoke QuickExit. This PR modifies this behavior so that before calling QuickExit, the CoreWorker processes kill all child processes.

Note that this does not address the case where CoreWorker processes crash or are killed with SIGKILL. That condition requires the subreaper work proposed in ray-project#34125 / ray-project#26118.

The ray stop --force case also requires the subreaper work. This is because --force kills CoreWorker processes with SIGKILL, leaving them no opportunity to clean up child processes. We can add the logic which catches leaked child processes, but the best design for this is the subreaper design.

Signed-off-by: e428265 <[email protected]>
vymao pushed a commit to vymao/ray that referenced this issue Oct 11, 2023
…ect#38439)

Summary
This PR has CoreWorkers kill their child processes when they notice the raylet has died. This fixes an edge case missing from ray-project#33976.

Details
The CoreWorker processes have a periodic check that validates the raylet process is alive. If the raylet is no longer alive, then the CoreWorker processes invoke QuickExit. This PR modifies this behavior so that before calling QuickExit, the CoreWorker processes kill all child processes.

Note that this does not address the case where CoreWorker processes crash or are killed with SIGKILL. That condition requires the subreaper work proposed in ray-project#34125 / ray-project#26118.

The ray stop --force case also requires the subreaper work. This is because --force kills CoreWorker processes with SIGKILL, leaving them no opportunity to clean up child processes. We can add the logic which catches leaked child processes, but the best design for this is the subreaper design.

Signed-off-by: Victor <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Issues that should be addressed in Ray Core core-correctness Leak, crash, hang P1 Issue that should be fixed within a few weeks Ray 2.5
Projects
None yet
Development

No branches or pull requests

4 participants