Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Actors not cleaning up resources correct because force_kill=true. #34124

Open
cadedaniel opened this issue Apr 6, 2023 · 10 comments
Open
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core core-correctness Leak, crash, hang core-worker P2 Important issue, but not time-critical ray 2.10 size:large size:medium stability

Comments

@cadedaniel
Copy link
Member

Discovered in the investigation of #31451 and #33976.

TL;DR things like actor destructors or atexit handlers are not guaranteed to be executed when we destroy actors. This is because we use force_kill=true in gcs_actor_manager.

Ideally, we should send SIGTERM to worker processes so that they clean up any important state. After some time period, if the process has not already died already, we will then send a SIGKILL.

More information in Sang's comment here #33976 (comment)

@cadedaniel cadedaniel added P1 Issue that should be fixed within a few weeks core Issues that should be addressed in Ray Core labels Apr 6, 2023
@clarng
Copy link
Contributor

clarng commented Apr 6, 2023

but if os oom killer kick in these handlers won't get called right? we cannot guarantee os oom killer will never kick in

@jjyao jjyao added the Ray 2.5 label Apr 6, 2023
@cadedaniel
Copy link
Member Author

Does our OOM killer immediately use SIGKILL? Or is there a SIGTERM first?

@clarng
Copy link
Contributor

clarng commented Apr 6, 2023

sigkill
if we wait then it might trigger os oom (which will force kill anyway)

@cadedaniel
Copy link
Member Author

I see. For the process leak case, the raylet will handle this case. For other resource leaks, needs more thought. One low hanging fruit is to improve OOM killer to send SIGTERM then SIGKILL. I believe this should be doable with a bit of design..

@clarng
Copy link
Contributor

clarng commented Apr 6, 2023

trying to clean resources up when the machine is under pressure may not be the best idea

what are we trying to clean up? can that be done async?

@rkooo567
Copy link
Contributor

rkooo567 commented Apr 7, 2023

@clarng this is only for the general shutdown cases. Right now, we always force kill actors even when graceful shutdown is possible (which is bad).

I think OOM killer sending SIGKILL is reasonable, and this just means we cannot properly trigger shutdown handler (that's the point of SIKILL actually) when they are OOM kiled

@rkooo567
Copy link
Contributor

rkooo567 commented Apr 7, 2023

Your case will be handled by #34125

@rkooo567 rkooo567 added the bug Something that is supposed to be working; but isn't label Apr 8, 2023
@rkooo567
Copy link
Contributor

rkooo567 commented Apr 8, 2023

#32952 -> this issue has a minimal repro script we should test

@mlguruz
Copy link

mlguruz commented Apr 10, 2023

I believe this is also related:

I kept running into this when using xgboost_ray:

Exception ignored in: <function PlacementGroupResourceManager.__del__ at 0x7f6722c8a550>
File "/home/ubuntu/zstar/lib/python3.8/site-packages/ray/air/execution/resources/placement_group.py", line 211, in __del__
File "/home/ubuntu/zstar/lib/python3.8/site-packages/ray/util/tracing/tracing_helper.py", line 466, in _resume_span
File "/home/ubuntu/zstar/lib/python3.8/site-packages/ray/air/execution/resources/placement_group.py", line 193, in clear
File "/home/ubuntu/zstar/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 98, in wrapper
ImportError: sys.meta_path is None, Python is likely shutting down
Exception ignored in: <function PlacementGroupResourceManager.__del__ at 0x7efecd343550>

Is there any way to disable this? Coz its really polluting the logs...

@rkooo567
Copy link
Contributor

Hmm, I believe it is actually a different issue. I think when you execute __del__, any module call is not guaranteed to succeed because modules can be GC'ed before the class is destructed.

cc @krfricke it is something PlacementGroupResourceManager should fix in its layer.

@rkooo567 rkooo567 added the ray 2.9 Issues targeting Ray 2.9 release (~Q4 CY2023) label Oct 9, 2023
@rkooo567 rkooo567 added ray 2.10 and removed ray 2.9 Issues targeting Ray 2.9 release (~Q4 CY2023) labels Dec 12, 2023
@jjyao jjyao added P2 Important issue, but not time-critical and removed P1 Issue that should be fixed within a few weeks labels Oct 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core core-correctness Leak, crash, hang core-worker P2 Important issue, but not time-critical ray 2.10 size:large size:medium stability
Projects
None yet
Development

No branches or pull requests

6 participants