-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Actors not cleaning up resources correct because force_kill=true
.
#34124
Comments
but if os oom killer kick in these handlers won't get called right? we cannot guarantee os oom killer will never kick in |
Does our OOM killer immediately use SIGKILL? Or is there a SIGTERM first? |
sigkill |
I see. For the process leak case, the raylet will handle this case. For other resource leaks, needs more thought. One low hanging fruit is to improve OOM killer to send SIGTERM then SIGKILL. I believe this should be doable with a bit of design.. |
trying to clean resources up when the machine is under pressure may not be the best idea what are we trying to clean up? can that be done async? |
@clarng this is only for the general shutdown cases. Right now, we always force kill actors even when graceful shutdown is possible (which is bad). I think OOM killer sending SIGKILL is reasonable, and this just means we cannot properly trigger shutdown handler (that's the point of SIKILL actually) when they are OOM kiled |
Your case will be handled by #34125 |
#32952 -> this issue has a minimal repro script we should test |
I believe this is also related: I kept running into this when using
Is there any way to disable this? Coz its really polluting the logs... |
Hmm, I believe it is actually a different issue. I think when you execute cc @krfricke it is something |
Discovered in the investigation of #31451 and #33976.
TL;DR things like actor destructors or atexit handlers are not guaranteed to be executed when we destroy actors. This is because we use
force_kill=true
in gcs_actor_manager.Ideally, we should send SIGTERM to worker processes so that they clean up any important state. After some time period, if the process has not already died already, we will then send a SIGKILL.
More information in Sang's comment here #33976 (comment)
The text was updated successfully, but these errors were encountered: