Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Serve][Core] Fix serve_long_running memory leak by fixing GCS pubsub asyncio task leak #29187

Merged
merged 11 commits into from
Oct 11, 2022

Conversation

architkulkarni
Copy link
Contributor

@architkulkarni architkulkarni commented Oct 7, 2022

Why are these changes needed?

Debugged with @simon-mo and @scv119 . The Serve long running test was failing due to a memory leak in dashboard.py. The root cause was in the GCS pubsub code, with the _close: asyncio.Event object adding millions of waits every few minutes without the waits ever being killed, causing the _close._waiters queue to grow without bound. The root cause is when awaiting with FIRST_COMPLETED, the caller is responsible for killing the unfinished task.

This PR:

  • Fixes the memory leak by canceling the close task if it wasn't done. (Contributed by @simon-mo)

This PR also adds some side improvements to the release test:

  • Use lower-memory instances so that memory leaks aren't hidden by the instances having a lot of available memory
  • Gracefully handle the case where the wrk fails, which previously caused the release test output to be overwritten in a tight loop, which led to a hard-to-interpret errors being surfaced to the release test infrastructure
  • Use different ports for the dashboard agents on the multiple cluster_utils virtual nodes to prevent port conflict

Related issue number

Closes #28977
May address #26568

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Archit Kulkarni <[email protected]>
Signed-off-by: Archit Kulkarni <[email protected]>
Signed-off-by: Archit Kulkarni <[email protected]>
Signed-off-by: Archit Kulkarni <[email protected]>
@architkulkarni architkulkarni marked this pull request as ready for review October 7, 2022 21:39
@architkulkarni
Copy link
Contributor Author

@simon-mo Do you mind editing the PR description if the description of the memory issue is inaccurate?

@architkulkarni
Copy link
Contributor Author

@scv119 @iycheng Is it feasible to add a unit test for the memory issue?

@scv119
Copy link
Contributor

scv119 commented Oct 7, 2022

great findings! I'll let @iycheng to help the unit test part. one possible way is to open up the _close()._waiters and ensure we have O(1) instead of O(n) number of waiters.

@simon-mo
Copy link
Contributor

simon-mo commented Oct 7, 2022

One thing you can do is to check asyncio.Task.all_tasks() is not growing indefinitely

python/ray/_private/gcs_pubsub.py Outdated Show resolved Hide resolved
python/ray/_private/gcs_pubsub.py Outdated Show resolved Hide resolved
architkulkarni and others added 2 commits October 7, 2022 15:00
Co-authored-by: Simon Mo <[email protected]>
Signed-off-by: Archit Kulkarni <[email protected]>
Co-authored-by: Simon Mo <[email protected]>
Signed-off-by: Archit Kulkarni <[email protected]>
@architkulkarni architkulkarni changed the title [WIP][Serve][Core] Fix serve_long_running memory leak by fixing GCS pubsub asyncio task leak [Serve][Core] Fix serve_long_running memory leak by fixing GCS pubsub asyncio task leak Oct 7, 2022
@architkulkarni
Copy link
Contributor Author

At @sihanwang41 's suggestion, I tested it out again, but this time without the change in the first two lines where we removed ensure_future, and the leak was still fixed. @simon-mo what's the benefit of changing ensure_future to create_task()?

@simon-mo
Copy link
Contributor

simon-mo commented Oct 7, 2022

future cannot be cancelled, task can. (this is the API in Python 3.6 if i recall...)... also ensure_future in Python 3.6 creates a task anyway, which is a confusing API.

@architkulkarni
Copy link
Contributor Author

Ah I see, I was testing on 3.8 so I didn't run into that

@architkulkarni
Copy link
Contributor Author

One thing you can do is to check asyncio.Task.all_tasks() is not growing indefinitely

Thanks, added this as a unit test

@architkulkarni
Copy link
Contributor Author

Screen Shot 2022-10-11 at 8 50 49 AM

Long running test passed

@architkulkarni
Copy link
Contributor Author

windows runtime env tests flaky on master
windows serve:test_api was broken on master
windows serve:tutorial_rllib was broken on master
windows test_args was broken on master
windows test_multi_node was broken on master
test_dashbaord was broken on master
test_client flaky on master
rllib cartpole tests broken on master

Java test failed due to randomly chosen port conflict, but I couldn't find evidence of flakiness on the tracker. Restarting the Java test to be safe

@architkulkarni architkulkarni added the release-blocker P0 Issue that blocks the release label Oct 11, 2022
@architkulkarni
Copy link
Contributor Author

Java passed on retry

@architkulkarni architkulkarni added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Oct 11, 2022
@architkulkarni architkulkarni merged commit d89a664 into ray-project:master Oct 11, 2022
@architkulkarni architkulkarni deleted the serve-leak-debug-3 branch October 11, 2022 17:29
architkulkarni added a commit that referenced this pull request Oct 11, 2022
… asyncio task leak (#29187)

Debugged with @simon-mo and @scv119 . The Serve long running test was failing due to a memory leak in dashboard.py. The root cause was in the GCS pubsub code, with the _close: asyncio.Event object adding millions of waits every few minutes without the waits ever being killed, causing the _close._waiters queue to grow without bound. The root cause is when awaiting with FIRST_COMPLETED, the caller is responsible for killing the unfinished task.

This PR:

Fixes the memory leak by canceling the close task if it wasn't done. (Contributed by @simon-mo)
This PR also adds some side improvements to the release test:

Use lower-memory instances so that memory leaks aren't hidden by the instances having a lot of available memory
Gracefully handle the case where the wrk fails, which previously caused the release test output to be overwritten in a tight loop, which led to a hard-to-interpret errors being surfaced to the release test infrastructure
Use different ports for the dashboard agents on the multiple cluster_utils virtual nodes to prevent port conflict
architkulkarni added a commit that referenced this pull request Oct 11, 2022
…S pubsub asyncio task leak (#29187)"

This reverts commit bced413.
architkulkarni added a commit to architkulkarni/ray that referenced this pull request Oct 11, 2022
… asyncio task leak (ray-project#29187)

Debugged with @simon-mo and @scv119 . The Serve long running test was failing due to a memory leak in dashboard.py. The root cause was in the GCS pubsub code, with the _close: asyncio.Event object adding millions of waits every few minutes without the waits ever being killed, causing the _close._waiters queue to grow without bound. The root cause is when awaiting with FIRST_COMPLETED, the caller is responsible for killing the unfinished task.

This PR:

Fixes the memory leak by canceling the close task if it wasn't done. (Contributed by @simon-mo)
This PR also adds some side improvements to the release test:

Use lower-memory instances so that memory leaks aren't hidden by the instances having a lot of available memory
Gracefully handle the case where the wrk fails, which previously caused the release test output to be overwritten in a tight loop, which led to a hard-to-interpret errors being surfaced to the release test infrastructure
Use different ports for the dashboard agents on the multiple cluster_utils virtual nodes to prevent port conflict
rickyyx pushed a commit that referenced this pull request Oct 14, 2022
WeichenXu123 pushed a commit to WeichenXu123/ray that referenced this pull request Dec 19, 2022
… asyncio task leak (ray-project#29187)

Debugged with @simon-mo and @scv119 . The Serve long running test was failing due to a memory leak in dashboard.py. The root cause was in the GCS pubsub code, with the _close: asyncio.Event object adding millions of waits every few minutes without the waits ever being killed, causing the _close._waiters queue to grow without bound. The root cause is when awaiting with FIRST_COMPLETED, the caller is responsible for killing the unfinished task.

This PR:

Fixes the memory leak by canceling the close task if it wasn't done. (Contributed by @simon-mo)
This PR also adds some side improvements to the release test:

Use lower-memory instances so that memory leaks aren't hidden by the instances having a lot of available memory
Gracefully handle the case where the wrk fails, which previously caused the release test output to be overwritten in a tight loop, which led to a hard-to-interpret errors being surfaced to the release test infrastructure
Use different ports for the dashboard agents on the multiple cluster_utils virtual nodes to prevent port conflict

Signed-off-by: Weichen Xu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Ray 2.1 release-blocker P0 Issue that blocks the release tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[release][CI] long_running_serve failure
5 participants