-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Clean up process leaks in worker processes, even if worker process crashes #34125
Labels
core
Issues that should be addressed in Ray Core
core-correctness
Leak, crash, hang
P1
Issue that should be fixed within a few weeks
Ray 2.5
Comments
cadedaniel
added
P1
Issue that should be fixed within a few weeks
core
Issues that should be addressed in Ray Core
labels
Apr 6, 2023
2 tasks
this is important as the os or oom killer may kill the worker process, i.e. crash is part of 'normal' operation |
#26118 Duplicates |
Closing in favor of #26118 |
rkooo567
pushed a commit
that referenced
this issue
Aug 17, 2023
Summary This PR has CoreWorkers kill their child processes when they notice the raylet has died. This fixes an edge case missing from #33976. Details The CoreWorker processes have a periodic check that validates the raylet process is alive. If the raylet is no longer alive, then the CoreWorker processes invoke QuickExit. This PR modifies this behavior so that before calling QuickExit, the CoreWorker processes kill all child processes. Note that this does not address the case where CoreWorker processes crash or are killed with SIGKILL. That condition requires the subreaper work proposed in #34125 / #26118. The ray stop --force case also requires the subreaper work. This is because --force kills CoreWorker processes with SIGKILL, leaving them no opportunity to clean up child processes. We can add the logic which catches leaked child processes, but the best design for this is the subreaper design.
arvind-chandra
pushed a commit
to lmco/ray
that referenced
this issue
Aug 31, 2023
…ect#38439) Summary This PR has CoreWorkers kill their child processes when they notice the raylet has died. This fixes an edge case missing from ray-project#33976. Details The CoreWorker processes have a periodic check that validates the raylet process is alive. If the raylet is no longer alive, then the CoreWorker processes invoke QuickExit. This PR modifies this behavior so that before calling QuickExit, the CoreWorker processes kill all child processes. Note that this does not address the case where CoreWorker processes crash or are killed with SIGKILL. That condition requires the subreaper work proposed in ray-project#34125 / ray-project#26118. The ray stop --force case also requires the subreaper work. This is because --force kills CoreWorker processes with SIGKILL, leaving them no opportunity to clean up child processes. We can add the logic which catches leaked child processes, but the best design for this is the subreaper design. Signed-off-by: e428265 <[email protected]>
vymao
pushed a commit
to vymao/ray
that referenced
this issue
Oct 11, 2023
…ect#38439) Summary This PR has CoreWorkers kill their child processes when they notice the raylet has died. This fixes an edge case missing from ray-project#33976. Details The CoreWorker processes have a periodic check that validates the raylet process is alive. If the raylet is no longer alive, then the CoreWorker processes invoke QuickExit. This PR modifies this behavior so that before calling QuickExit, the CoreWorker processes kill all child processes. Note that this does not address the case where CoreWorker processes crash or are killed with SIGKILL. That condition requires the subreaper work proposed in ray-project#34125 / ray-project#26118. The ray stop --force case also requires the subreaper work. This is because --force kills CoreWorker processes with SIGKILL, leaving them no opportunity to clean up child processes. We can add the logic which catches leaked child processes, but the best design for this is the subreaper design. Signed-off-by: Victor <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
core
Issues that should be addressed in Ray Core
core-correctness
Leak, crash, hang
P1
Issue that should be fixed within a few weeks
Ray 2.5
In #33976 we fix certain worker child process leaks by listing them and killing them when the core worker shuts down. This is not a complete solution, as it will still leak processes if the worker process crashes.
We should have a better story around cleaning up leaked processes. One potentially solid implementation is to configure the raylet with
PR_SET_CHILD_SUBREAPER
, so that all orphaned descendant processes are re-parented to the raylet. The raylet would then periodically poll its immediate child processes and clean up any that are not known worker processes.The text was updated successfully, but these errors were encountered: