[Serve][Core] Fix serve_long_running memory leak by fixing GCS pubsub asyncio task leak #29187

architkulkarni · 2022-10-07T21:23:05Z

Why are these changes needed?

Debugged with @simon-mo and @scv119 . The Serve long running test was failing due to a memory leak in dashboard.py. The root cause was in the GCS pubsub code, with the _close: asyncio.Event object adding millions of waits every few minutes without the waits ever being killed, causing the _close._waiters queue to grow without bound. The root cause is when awaiting with FIRST_COMPLETED, the caller is responsible for killing the unfinished task.

This PR:

Fixes the memory leak by canceling the close task if it wasn't done. (Contributed by @simon-mo)

This PR also adds some side improvements to the release test:

Use lower-memory instances so that memory leaks aren't hidden by the instances having a lot of available memory
Gracefully handle the case where the wrk fails, which previously caused the release test output to be overwritten in a tight loop, which led to a hard-to-interpret errors being surfaced to the release test infrastructure
Use different ports for the dashboard agents on the multiple cluster_utils virtual nodes to prevent port conflict

Related issue number

Closes #28977
May address #26568

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Archit Kulkarni <[email protected]>

architkulkarni · 2022-10-07T21:40:23Z

@simon-mo Do you mind editing the PR description if the description of the memory issue is inaccurate?

architkulkarni · 2022-10-07T21:42:27Z

@scv119 @iycheng Is it feasible to add a unit test for the memory issue?

scv119 · 2022-10-07T21:48:24Z

great findings! I'll let @iycheng to help the unit test part. one possible way is to open up the _close()._waiters and ensure we have O(1) instead of O(n) number of waiters.

simon-mo · 2022-10-07T21:57:14Z

One thing you can do is to check asyncio.Task.all_tasks() is not growing indefinitely

python/ray/_private/gcs_pubsub.py

Co-authored-by: Simon Mo <[email protected]> Signed-off-by: Archit Kulkarni <[email protected]>

architkulkarni · 2022-10-07T22:03:06Z

At @sihanwang41 's suggestion, I tested it out again, but this time without the change in the first two lines where we removed ensure_future, and the leak was still fixed. @simon-mo what's the benefit of changing ensure_future to create_task()?

simon-mo · 2022-10-07T22:04:26Z

future cannot be cancelled, task can. (this is the API in Python 3.6 if i recall...)... also ensure_future in Python 3.6 creates a task anyway, which is a confusing API.

architkulkarni · 2022-10-07T22:05:43Z

Ah I see, I was testing on 3.8 so I didn't run into that

Signed-off-by: Archit Kulkarni <[email protected]>

…i/ray into serve-leak-debug-3

architkulkarni · 2022-10-07T23:04:20Z

One thing you can do is to check asyncio.Task.all_tasks() is not growing indefinitely

Thanks, added this as a unit test

architkulkarni · 2022-10-11T15:50:57Z

Long running test passed

architkulkarni · 2022-10-11T15:55:41Z

windows runtime env tests flaky on master
windows serve:test_api was broken on master
windows serve:tutorial_rllib was broken on master
windows test_args was broken on master
windows test_multi_node was broken on master
test_dashbaord was broken on master
test_client flaky on master
rllib cartpole tests broken on master

Java test failed due to randomly chosen port conflict, but I couldn't find evidence of flakiness on the tracker. Restarting the Java test to be safe

architkulkarni · 2022-10-11T17:29:23Z

Java passed on retry

@simon-mo

… asyncio task leak (#29187) Debugged with @simon-mo and @scv119 . The Serve long running test was failing due to a memory leak in dashboard.py. The root cause was in the GCS pubsub code, with the _close: asyncio.Event object adding millions of waits every few minutes without the waits ever being killed, causing the _close._waiters queue to grow without bound. The root cause is when awaiting with FIRST_COMPLETED, the caller is responsible for killing the unfinished task. This PR: Fixes the memory leak by canceling the close task if it wasn't done. (Contributed by @simon-mo) This PR also adds some side improvements to the release test: Use lower-memory instances so that memory leaks aren't hidden by the instances having a lot of available memory Gracefully handle the case where the wrk fails, which previously caused the release test output to be overwritten in a tight loop, which led to a hard-to-interpret errors being surfaced to the release test infrastructure Use different ports for the dashboard agents on the multiple cluster_utils virtual nodes to prevent port conflict

…S pubsub asyncio task leak (#29187)" This reverts commit bced413.

@simon-mo

… asyncio task leak (ray-project#29187) Debugged with @simon-mo and @scv119 . The Serve long running test was failing due to a memory leak in dashboard.py. The root cause was in the GCS pubsub code, with the _close: asyncio.Event object adding millions of waits every few minutes without the waits ever being killed, causing the _close._waiters queue to grow without bound. The root cause is when awaiting with FIRST_COMPLETED, the caller is responsible for killing the unfinished task. This PR: Fixes the memory leak by canceling the close task if it wasn't done. (Contributed by @simon-mo) This PR also adds some side improvements to the release test: Use lower-memory instances so that memory leaks aren't hidden by the instances having a lot of available memory Gracefully handle the case where the wrk fails, which previously caused the release test output to be overwritten in a tight loop, which led to a hard-to-interpret errors being surfaced to the release test infrastructure Use different ports for the dashboard agents on the multiple cluster_utils virtual nodes to prevent port conflict

… asyncio task leak (#29187) (#29220) Signed-off-by: Ricky Xu <[email protected]>

@simon-mo

… asyncio task leak (ray-project#29187) Debugged with @simon-mo and @scv119 . The Serve long running test was failing due to a memory leak in dashboard.py. The root cause was in the GCS pubsub code, with the _close: asyncio.Event object adding millions of waits every few minutes without the waits ever being killed, causing the _close._waiters queue to grow without bound. The root cause is when awaiting with FIRST_COMPLETED, the caller is responsible for killing the unfinished task. This PR: Fixes the memory leak by canceling the close task if it wasn't done. (Contributed by @simon-mo) This PR also adds some side improvements to the release test: Use lower-memory instances so that memory leaks aren't hidden by the instances having a lot of available memory Gracefully handle the case where the wrk fails, which previously caused the release test output to be overwritten in a tight loop, which led to a hard-to-interpret errors being surfaced to the release test infrastructure Use different ports for the dashboard agents on the multiple cluster_utils virtual nodes to prevent port conflict Signed-off-by: Weichen Xu <[email protected]>

architkulkarni added 7 commits October 7, 2022 13:17

Most of the changes but no measurements. Leaks.

1b5ec56

Signed-off-by: Archit Kulkarni <[email protected]>

Add debug prints

24c1da3

Signed-off-by: Archit Kulkarni <[email protected]>

Fix and lint

9077c63

Signed-off-by: Archit Kulkarni <[email protected]>

Add @simon-mo's fix for gcs ensure_future leak

9ff8f5f

Signed-off-by: Archit Kulkarni <[email protected]>

Remove debug prints

2fd80b2

Signed-off-by: Archit Kulkarni <[email protected]>

Use low memory instances to catch memory leaks

c18fcdf

Signed-off-by: Archit Kulkarni <[email protected]>

Fix typo

09f389d

Signed-off-by: Archit Kulkarni <[email protected]>

architkulkarni assigned simon-mo, fishbone and scv119 Oct 7, 2022

architkulkarni marked this pull request as ready for review October 7, 2022 21:39

simon-mo approved these changes Oct 7, 2022

View reviewed changes

python/ray/_private/gcs_pubsub.py Outdated Show resolved Hide resolved

python/ray/_private/gcs_pubsub.py Outdated Show resolved Hide resolved

architkulkarni assigned sihanwang41 Oct 7, 2022

architkulkarni and others added 2 commits October 7, 2022 15:00

Update python/ray/_private/gcs_pubsub.py

30e8b03

Co-authored-by: Simon Mo <[email protected]> Signed-off-by: Archit Kulkarni <[email protected]>

Update python/ray/_private/gcs_pubsub.py

949a44d

Co-authored-by: Simon Mo <[email protected]> Signed-off-by: Archit Kulkarni <[email protected]>

architkulkarni changed the title ~~[WIP][Serve][Core] Fix serve_long_running memory leak by fixing GCS pubsub asyncio task leak~~ [Serve][Core] Fix serve_long_running memory leak by fixing GCS pubsub asyncio task leak Oct 7, 2022

architkulkarni added 2 commits October 7, 2022 15:19

Add unit test

5026341

Signed-off-by: Archit Kulkarni <[email protected]>

Merge branch 'serve-leak-debug-3' of https://github.com/architkulkarn…

42f0f8c

…i/ray into serve-leak-debug-3

architkulkarni added Ray 2.1 v2.1.0-pick labels Oct 11, 2022

architkulkarni added the release-blocker P0 Issue that blocks the release label Oct 11, 2022

architkulkarni added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Oct 11, 2022

architkulkarni merged commit d89a664 into ray-project:master Oct 11, 2022

architkulkarni deleted the serve-leak-debug-3 branch October 11, 2022 17:29

architkulkarni added a commit that referenced this pull request Oct 11, 2022

Revert "[Serve][Core] Fix serve_long_running memory leak by fixing GC…

460369b

…S pubsub asyncio task leak (#29187)" This reverts commit bced413.

architkulkarni mentioned this pull request Oct 11, 2022

[Serve][Core] Fix serve_long_running memory leak by fixing GCS pubsub asyncio task leak (#29187) #29220

Merged

7 tasks

rickyyx pushed a commit that referenced this pull request Oct 14, 2022

[Serve][Core] Fix serve_long_running memory leak by fixing GCS pubsub…

7b86c1f

… asyncio task leak (#29187) (#29220) Signed-off-by: Ricky Xu <[email protected]>

architkulkarni mentioned this pull request Oct 24, 2022

[Dashboard] The dashboard.py process leaks memory #26568

Closed

Michaelvll mentioned this pull request Jan 26, 2023

[Core] Upgrade ray to 2.3.0 skypilot-org/skypilot#1618

Closed

12 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Serve][Core] Fix serve_long_running memory leak by fixing GCS pubsub asyncio task leak #29187

[Serve][Core] Fix serve_long_running memory leak by fixing GCS pubsub asyncio task leak #29187

architkulkarni commented Oct 7, 2022 •

edited

Loading

architkulkarni commented Oct 7, 2022

architkulkarni commented Oct 7, 2022

scv119 commented Oct 7, 2022 •

edited

Loading

simon-mo commented Oct 7, 2022

architkulkarni commented Oct 7, 2022

simon-mo commented Oct 7, 2022 •

edited

Loading

architkulkarni commented Oct 7, 2022

architkulkarni commented Oct 7, 2022

architkulkarni commented Oct 11, 2022

architkulkarni commented Oct 11, 2022

architkulkarni commented Oct 11, 2022

[Serve][Core] Fix serve_long_running memory leak by fixing GCS pubsub asyncio task leak #29187

[Serve][Core] Fix serve_long_running memory leak by fixing GCS pubsub asyncio task leak #29187

Conversation

architkulkarni commented Oct 7, 2022 • edited Loading

Why are these changes needed?

Related issue number

Checks

architkulkarni commented Oct 7, 2022

architkulkarni commented Oct 7, 2022

scv119 commented Oct 7, 2022 • edited Loading

simon-mo commented Oct 7, 2022

architkulkarni commented Oct 7, 2022

simon-mo commented Oct 7, 2022 • edited Loading

architkulkarni commented Oct 7, 2022

architkulkarni commented Oct 7, 2022

architkulkarni commented Oct 11, 2022

architkulkarni commented Oct 11, 2022

architkulkarni commented Oct 11, 2022

architkulkarni commented Oct 7, 2022 •

edited

Loading

scv119 commented Oct 7, 2022 •

edited

Loading

simon-mo commented Oct 7, 2022 •

edited

Loading