-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core] Multi-process(?) / GPU processes do not seem to be freed after ctrl+c on cluster #31451
Comments
@thoglu , can you confirm the following exhibits the same leak for you?
|
Hmm no this one does not have the leak, I will try a few things out tomorrow and work my way toward my own situation ( its too late here right now). |
@thoglu could you share what your trainable definition looks like? |
Hmm so it has to do with the trainable, I checked with my submission script and just replaced with the simple trainable from above, and there is no problem. My actual trainable involves a pytorch-lightning training loop and many lines of code split up in different files.. anything in particular I should be looking for? Can this be related to a garbage collection issue with some tensor that was not detached and some reference kept? I would imagine all of that should not matter once you ctrl+c? Or pytorch-lightning? |
Do you have any multiprocessing in your script, e.g. a One way to verify this is to run Related: ray-project/ray_lightning#87 |
@matthewdeng @ericl So it is the same issue as ray-project/ray_lightning#87. Is there any hope that this will get solved at all? The issue is open since over a year already. It is not even a lightning issue, but a "Dataloader in connection to ray" issue, right? There must be other people seeing this already .. I presume |
Yeah, I think there is an underlying process management bug here when those workers are forked. I'll keep the P1 tag. @scv119 , is this something we can slot for 2.3-2.4? |
@ericl @matthewdeng EDIT It actually did not solve the issue, I just ran the job too shortly. After starting the dataloader with |
Yeah, I think Ray should in principle be able to kill the process successfully even if num_workers>0 / there is an NVidia issue. Maybe there's some issue with the worker graceful shutdown and we need to force kill it. |
Hi @thoglu, could you run a |
@cadedaniel the processes are in |
This can be reproduced without GPUs, though in practice I think this is more noticeable when using GPUs because the GRAM is held. Minimal repro: import ray
import time
from multiprocessing import Process
@ray.remote
class MyActor:
def run(self):
p = Process(target = lambda: time.sleep(1000), daemon=True)
p.start()
p.join()
actor = MyActor.remote()
ray.get(actor.run.remote()) When executing the script:
After terminating the script with
|
cc @rkooo567 this is the issue we investigated before. |
It likely is the same as you list here @matthewdeng , I guess I'm still wondering why the driver upgrade fixed the issue. It could be that 510 drivers more aggressively free resources. After fixing the issue we should loop back and verify it works on 470 drivers (I think they still have a good bit of lifetime in them but could be wrong) |
@matthewdeng @cadedaniel @ericl So the driver update does not fix the issue for me. Here again an example with
after
The main proces seems to be gone, but the 10 workers remain. |
I have the same issue in a different context. Workers end properly, all object references cleared, but they stay in |
it's interesting that pytorch's dataloader subprocess does health check with its parent pytorch/pytorch#6606 , so it suppose to terminate itself it the parent dies. |
I've spent some time with Ray + Torch dataloader and can't reproduce the reported behavior. I think it's possible for Ray Lightning + Lightning + Torch Dataloader to have the issue when Ray + Torch dataloader doesn't, as the Ray Lightning integration overrides some cleanup logic in the default Lightning. Things I've tried:
I will try an end-to-end example on DataLoader + Ray Tune + Ray Lightning + Lightning tomorrow. I have also been trying exclusively in Ray Jobs and should also try in Ray Client. |
It would be really helpful to have a runnable reproduction script, which includes the Lightning components. There is a good chance I won't be able to reproduce without it. |
What about @matthewdeng 's Jan 5th repro above with just multiprocessing? |
I stopped previous raytune run by pressing Ctrl-C multiple times then next tune run had OOM, maybe was the issues. Now I always run "ray stop" after cancelling experiments with SIGINT, everything is fine now. |
Hmm, I don't think this is exactly what's happening here because the torch dataloaders should die when their ppid changes. When I run the repro, the spawned process ppid changes to That said, if the raylet polled each worker process for child processes, we could keep track of other processes to kill, and likely fix this case. Is there a better way to track which processes to clean up? Overall seems that the Torch dataloader spawning processes which don't die when their parent process dies is the root cause here (although we're probably adding an edge case they didn't think of). |
Let's start by fixing the simple multiprocessing repro case?
…On Thu, Mar 9, 2023, 7:18 PM Cade Daniel ***@***.***> wrote:
What about @matthewdeng <https://github.com/matthewdeng> 's Jan 5th repro
above with just multiprocessing?
Hmm, I don't think this is exactly what's happening here because the torch
dataloaders should die when their ppid changes. When I run the repro, the
spawned process ppid changes to 1.
That said, if the raylet polled each worker process for child processes,
we could keep track of other processes to kill, and likely fix this case.
Is there a better way to track which processes to clean up? Overall seems
that the Torch dataloader spawning processes which don't die when their
parent process dies is the root cause here (although we're probably adding
an edge case they didn't think of).
—
Reply to this email directly, view it on GitHub
<#31451 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAADUSVWIKSBN7R2543X75LW3KMP7ANCNFSM6AAAAAATRKK7IU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
This would not fix the issue in the case I describe above: When a remote worker dies for whatever reason, a ray stop might not be possible, and when the main process tries to resend a job to the worker node, it still has full memory |
… child processes in the CoreWorker shutdown sequence. (#33976) We kill all child processes when a Ray worker process exits. This addresses process leaks that caused GPU OOM errors in #31451. There is some risk to this PR, particularly if Ray users rely on Ray's existing behavior of leaking processes. We don't know of any such user, but we provide a new flag RAY_kill_child_processes_on_worker_exit to provide a workaround in case someone is impacted.
… child processes in the CoreWorker shutdown sequence. (ray-project#33976) We kill all child processes when a Ray worker process exits. This addresses process leaks that caused GPU OOM errors in ray-project#31451. There is some risk to this PR, particularly if Ray users rely on Ray's existing behavior of leaking processes. We don't know of any such user, but we provide a new flag RAY_kill_child_processes_on_worker_exit to provide a workaround in case someone is impacted.
Hi all, I have an update for this issue: We merged a partial fix into master and expect it to make out in Ray 2.4. On Linux, in the case where the driver script is cancelled or exits normally, each Ray worker process will now kill its immediate child processes. Although we could not reproduce the Torch dataloader process leak described here, we believe this will fix the Torch issue and free the previously reserved GPU memory. We have plans for a more holistic approach to handle cases where the worker processes crash and leak processes, and where child processes cause leaks by spawning child processes of their own. Please reach out if you are experiencing these issues. Follow the below issues for updates. Thanks!
|
… child processes in the CoreWorker shutdown sequence. (#33976) (#34181) We kill all child processes when a Ray worker process exits. This addresses process leaks that caused GPU OOM errors in #31451. There is some risk to this PR, particularly if Ray users rely on Ray's existing behavior of leaking processes. We don't know of any such user, but we provide a new flag RAY_kill_child_processes_on_worker_exit to provide a workaround in case someone is impacted.
… child processes in the CoreWorker shutdown sequence. (ray-project#33976) We kill all child processes when a Ray worker process exits. This addresses process leaks that caused GPU OOM errors in ray-project#31451. There is some risk to this PR, particularly if Ray users rely on Ray's existing behavior of leaking processes. We don't know of any such user, but we provide a new flag RAY_kill_child_processes_on_worker_exit to provide a workaround in case someone is impacted. Signed-off-by: elliottower <[email protected]>
… child processes in the CoreWorker shutdown sequence. (ray-project#33976) We kill all child processes when a Ray worker process exits. This addresses process leaks that caused GPU OOM errors in ray-project#31451. There is some risk to this PR, particularly if Ray users rely on Ray's existing behavior of leaking processes. We don't know of any such user, but we provide a new flag RAY_kill_child_processes_on_worker_exit to provide a workaround in case someone is impacted. Signed-off-by: Jack He <[email protected]>
What happened + What you expected to happen
This has been discussed in #30414 at the end, but it seems more approriate to start a new issue because it is probably unrelated.
I run a cluster with a head node, and worker nodes. The head node gets a CLI
ray start ...options
, the worker nodes also do a CLIray start .. options
and connect to the head. All the worker nodes have GPUs.On the head node I then run a tune script with GPU request. After ctrl+c, the GPU does not seem to be freed on the worker nodes, and IDLE or TRAIN processes remain that block the GPU memory. Only a full
ray stop
kills all the processes.I tested this also in a much more simple setup (just a head node with a GPU), run the tune script, do ctrl+c, and the memory on the GPU remains blocked and ray:TRAIN processes remain (see #30414).
EDIT: The issue was found to be related to
num_workers>0
in pytorch DataLoader, which leaves extra ray processes open after ctrl+c.. Related: ray-project/ray_lightning#87, pytorch/pytorch#66482EDIT 2: I could solve the issue by using 515.xxx NVIDIA drivers (but only for the main node), but for 470.xxx and/or the worker nodes the issue seems to remain.EDIT 3: The issue persists, irregardless of driver version
Versions / Dependencies
Ray 2.2.0
Pytorch 1.12.1
NVIDIA driver 470.xxx
Reproduction script
Something like
Any tune job that runs on a node that is previously started with
ray start --num_cpus=4 --num_gpus=1
.Issue Severity
Medium: It is a significant difficulty but I can work around it.
The text was updated successfully, but these errors were encountered: