-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Serve][Core] Fix serve_long_running memory leak by fixing GCS pubsub asyncio task leak #29187
[Serve][Core] Fix serve_long_running memory leak by fixing GCS pubsub asyncio task leak #29187
Conversation
Signed-off-by: Archit Kulkarni <[email protected]>
Signed-off-by: Archit Kulkarni <[email protected]>
Signed-off-by: Archit Kulkarni <[email protected]>
Signed-off-by: Archit Kulkarni <[email protected]>
Signed-off-by: Archit Kulkarni <[email protected]>
Signed-off-by: Archit Kulkarni <[email protected]>
Signed-off-by: Archit Kulkarni <[email protected]>
@simon-mo Do you mind editing the PR description if the description of the memory issue is inaccurate? |
@scv119 @iycheng Is it feasible to add a unit test for the memory issue? |
great findings! I'll let @iycheng to help the unit test part. one possible way is to open up the _close()._waiters and ensure we have O(1) instead of O(n) number of waiters. |
One thing you can do is to check |
Co-authored-by: Simon Mo <[email protected]> Signed-off-by: Archit Kulkarni <[email protected]>
Co-authored-by: Simon Mo <[email protected]> Signed-off-by: Archit Kulkarni <[email protected]>
At @sihanwang41 's suggestion, I tested it out again, but this time without the change in the first two lines where we removed |
future cannot be cancelled, task can. (this is the API in Python 3.6 if i recall...)... also ensure_future in Python 3.6 creates a task anyway, which is a confusing API. |
Ah I see, I was testing on 3.8 so I didn't run into that |
Signed-off-by: Archit Kulkarni <[email protected]>
…i/ray into serve-leak-debug-3
Thanks, added this as a unit test |
windows runtime env tests flaky on master Java test failed due to randomly chosen port conflict, but I couldn't find evidence of flakiness on the tracker. Restarting the Java test to be safe |
Java passed on retry |
… asyncio task leak (#29187) Debugged with @simon-mo and @scv119 . The Serve long running test was failing due to a memory leak in dashboard.py. The root cause was in the GCS pubsub code, with the _close: asyncio.Event object adding millions of waits every few minutes without the waits ever being killed, causing the _close._waiters queue to grow without bound. The root cause is when awaiting with FIRST_COMPLETED, the caller is responsible for killing the unfinished task. This PR: Fixes the memory leak by canceling the close task if it wasn't done. (Contributed by @simon-mo) This PR also adds some side improvements to the release test: Use lower-memory instances so that memory leaks aren't hidden by the instances having a lot of available memory Gracefully handle the case where the wrk fails, which previously caused the release test output to be overwritten in a tight loop, which led to a hard-to-interpret errors being surfaced to the release test infrastructure Use different ports for the dashboard agents on the multiple cluster_utils virtual nodes to prevent port conflict
… asyncio task leak (ray-project#29187) Debugged with @simon-mo and @scv119 . The Serve long running test was failing due to a memory leak in dashboard.py. The root cause was in the GCS pubsub code, with the _close: asyncio.Event object adding millions of waits every few minutes without the waits ever being killed, causing the _close._waiters queue to grow without bound. The root cause is when awaiting with FIRST_COMPLETED, the caller is responsible for killing the unfinished task. This PR: Fixes the memory leak by canceling the close task if it wasn't done. (Contributed by @simon-mo) This PR also adds some side improvements to the release test: Use lower-memory instances so that memory leaks aren't hidden by the instances having a lot of available memory Gracefully handle the case where the wrk fails, which previously caused the release test output to be overwritten in a tight loop, which led to a hard-to-interpret errors being surfaced to the release test infrastructure Use different ports for the dashboard agents on the multiple cluster_utils virtual nodes to prevent port conflict
… asyncio task leak (#29187) (#29220) Signed-off-by: Ricky Xu <[email protected]>
… asyncio task leak (ray-project#29187) Debugged with @simon-mo and @scv119 . The Serve long running test was failing due to a memory leak in dashboard.py. The root cause was in the GCS pubsub code, with the _close: asyncio.Event object adding millions of waits every few minutes without the waits ever being killed, causing the _close._waiters queue to grow without bound. The root cause is when awaiting with FIRST_COMPLETED, the caller is responsible for killing the unfinished task. This PR: Fixes the memory leak by canceling the close task if it wasn't done. (Contributed by @simon-mo) This PR also adds some side improvements to the release test: Use lower-memory instances so that memory leaks aren't hidden by the instances having a lot of available memory Gracefully handle the case where the wrk fails, which previously caused the release test output to be overwritten in a tight loop, which led to a hard-to-interpret errors being surfaced to the release test infrastructure Use different ports for the dashboard agents on the multiple cluster_utils virtual nodes to prevent port conflict Signed-off-by: Weichen Xu <[email protected]>
Why are these changes needed?
Debugged with @simon-mo and @scv119 . The Serve long running test was failing due to a memory leak in
dashboard.py
. The root cause was in the GCS pubsub code, with the_close: asyncio.Event
object adding millions ofwait
s every few minutes without the waits ever being killed, causing the_close._waiters
queue to grow without bound. The root cause is whenawait
ing withFIRST_COMPLETED
, the caller is responsible for killing the unfinished task.This PR:
close
task if it wasn't done. (Contributed by @simon-mo)This PR also adds some side improvements to the release test:
wrk
fails, which previously caused the release test output to be overwritten in a tight loop, which led to a hard-to-interpret errors being surfaced to the release test infrastructurecluster_utils
virtual nodes to prevent port conflictRelated issue number
Closes #28977
May address #26568
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.