Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Cherry-pick] "[CoreWorker] Partially address Ray child process leaks by killing all child processes in the CoreWorker shutdown sequence. #33976" #34181

Merged

Conversation

cadedaniel
Copy link
Member

Cherry-picking #33976 onto the release branch. See that PR for a full description; TL;DR this partially fixes a process leak issue that impacts PyTorch dataloader users #31451.

… child processes in the CoreWorker shutdown sequence. (ray-project#33976)

We kill all child processes when a Ray worker process exits. This addresses process leaks that caused GPU OOM errors in ray-project#31451. There is some risk to this PR, particularly if Ray users rely on Ray's existing behavior of leaking processes. We don't know of any such user, but we provide a new flag RAY_kill_child_processes_on_worker_exit to provide a workaround in case someone is impacted.
@cadedaniel cadedaniel added core Issues that should be addressed in Ray Core v2.4.0-pick labels Apr 7, 2023
@clarng
Copy link
Contributor

clarng commented Apr 8, 2023

lets get TL approval

Copy link
Collaborator

@zhe-thoughts zhe-thoughts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved for picking into 2.4 since the issue is a release blocker

@scv119 scv119 merged commit eb67c06 into ray-project:releases/2.4.0 Apr 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Issues that should be addressed in Ray Core
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants