Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure pool connection is released in acquire when cancelled #548

Closed
wants to merge 5 commits into from

Conversation

aaliddell
Copy link
Contributor

@aaliddell aaliddell commented Mar 28, 2020

Closes #547 and possibly #467

When wait_for is cancelled, there is a chance that the waited task has already been completed, leaving the connection looking like it is in use. This fix ensures that the connection is returned to the pool in this situation.

A similar bug was fixed in #468, also with wait_for. The python issue was https://bugs.python.org/issue37658, although this is arguably not a solvable bug in Python itself as it can't know how to undo a completed task.

Closes MagicStack#547

When wait_for is cancelled, there is a chance that the waited
task has already been completed, leaving the connection
looking like it is in use. This fix ensures that the connection
is returned to the pool in this situation.

For context, see:
https://bugs.python.org/issue37658
MagicStack#467
Added issue link and comment describing why this workaround
is necessary. Also moved the function to be local to the
call site, as it it not used elsewhere.
asyncpg/pool.py Outdated Show resolved Hide resolved
When wait_for is called with timeout=None, it runs without
a timeout, as desired
@aaliddell
Copy link
Contributor Author

There is something odd going on with the Windows 64 Python 3.8 tests, with a different single test failing each time. As far as I can see, these should not be affected by any of the changes in this PR, but do not show up on master. Is this bad luck with a flaky test or something you've seen before?

@elprans
Copy link
Member

elprans commented May 5, 2020

Looks like a Python bug to me. The proactor event look tries to shutdown a socket which was already terminated on the other end.

@aaliddell
Copy link
Contributor Author

May I prod on getting this looked at again?

This bug crops up often during CI/testing where pools are created and destroyed frequently, leading to the tests hanging whilst waiting for the pool to close.

@elprans
Copy link
Member

elprans commented Jun 19, 2020

I'll take a look at this soon. Thanks.

@simzen85
Copy link

@elprans may I know the status of this PR as our team seems to have the same issue and want to try the fix. Thanks

@elprans
Copy link
Member

elprans commented Aug 11, 2020

I'm not a fan of the piecemeal approach here. This seems like a bug in asyncio's create_connection in the first place, so it would be best if we 1) found and fixed a bug there; and 2) backported the fix into asyncpg as a create_connection wrapper. I'll try to look into the asyncio side of things this week, time permitting.

@aaliddell
Copy link
Contributor Author

The issue lies with wait_for rather than with create_connection. See https://bugs.python.org/issue37658 for the relevant Python bug, which hasn't had much traction. Also see the original local bug, which gives better detail on the sequence of events: #547

The gist of it is that the cancellation can race the pool acquire. In this case, the acquire task has already completed and connection has already been marked as in use, but the caller of wait_for sees the cancellation and hence the connection gets stuck in limbo.

This may be solvable in fixing wait_for in Python, but I'm not sure it can be done in a backwards compatible way and that won't solve the issue for existing Python versions unless you backport the new wait_for impl locally. The current culprit lines I believe are these:

https://github.com/python/cpython/blob/9eb35ab0d71a6bd680e84fa0f828cb634e72b681/Lib/asyncio/tasks.py#L477-L480

An untested solution may be this:

        except exceptions.CancelledError:
            fut.remove_done_callback(cb)
            if fut.done():
                return fut.result()
            fut.cancel()
            raise

But this solution may have unintended consequences, since there are now situations where wait_for will ignore a cancellation and return anyway, which is generally highly discouraged. An alternate solution would be to ditch wait_for completely and embed the timeout logic into the relevant functions, thereby allowing for handling this situation.

@aaliddell
Copy link
Contributor Author

Actually, on second thought: the stem of this issue with wait_for is that it is awaiting the timeout task, rather than the actual desired task. In that current layout, this issue may not be solvable cleanly.

Instead, the tasks in wait_for should be flipped, such that the timeout waiter task is the background task, which can then inject the cancellation into the main task. In that case, cancelling wait_for would behave correctly, since it would either return the task completion or cancel the pending task, with no gap for a cancellation to slip in as far as I can tell. There'd have to be some checks to ensure TimeoutError is raised correctly, but I can't see these being showstoppers.

@elprans
Copy link
Member

elprans commented Aug 15, 2020

is that it is awaiting the timeout task, rather than the actual desired task. In that current layout, this issue may not be solvable cleanly.

Not quite. wait_for isn't awaiting on the timeout task, it simply blocks on a future and then task and the timeout race to resolve that future. There is no "foreground" or "background" tasks there, they're equal. The real issue is that create_connection is not handling the cancellation correctly and doesn't clean up after itself in all cases.

@elprans
Copy link
Member

elprans commented Aug 15, 2020

Upon closer examination it actually does seem like a bug in wait_for: a race between waited task completion and wait_for cancellation. Task completion should take precedence like in a naked future, so:

import asyncio
import asyncpg

async def main():

    fut = asyncio.get_event_loop().create_future()
    fut.set_result(True)

    waiter = asyncio.ensure_future(asyncio.wait_for(fut, timeout=1))
    waiter.cancel()

    result = await waiter
    print(result)

asyncio.run(main())

The above should print True, but it raises CancelledError instead. I'll post a fix upstream and a workaround for asyncpg would be to copy-paste the fixed implementation of wait_for until things get merged and backpatched.

@aaliddell
Copy link
Contributor Author

aaliddell commented Aug 15, 2020

Yep, I've got a fixed wait_for implementation that does what I proposed and in doing so it turns out it's broken in another way too: if the timeout fires and attempts to cancel the task and the task absorbs the cancellation, the task's return value is silently dropped, which will also give this same resource leak issue. I'll push it to my cpython repo fork shortly and let you know, if you want to take it?

If you've done an upstream cpython change before it may be easier for you to submit, rather than me figure out how the python change process works? Otherwise I'm happy enough to try myself.

@elprans
Copy link
Member

elprans commented Aug 15, 2020

the task absorbs the cancellation, the task's return value is silently dropped, which will also give this same resource leak issue

This isn't a bug an wait_for, the task being cancelled must be well-behaved.

@aaliddell
Copy link
Contributor Author

So there's two sides to that bit:

  • wait_for could transparently behave like the task it wraps. If the task is well-behaved: wait_for should be well behaved. If the task is poorly behaved and returns after swallowing a cancellation: wait_for should be poorly behaved and return after swallowing a cancellation (even if that cancellation was the timeout). The downside here is that TimeoutError will not be raised in that specific situation.

    e.g. The two lines should behave identically in terms of external cancellation:

    await task
    await wait_for(task, timeout=100)
    
  • wait_for could guarantee that it will raise TimeoutError if the timer finishes, regardless of whether the task is well behaved or not. The downside here is that a badly behaved task result will be lost and wait_for no longer behaves exactly like the underlying task. This is the current behaviour.

To me, the former behaviour seems more clear to a user, even if badly behaved tasks are an edge case.

Finally, the current behaviour of wait_for is actually already inconsistent:

  • If timeout == None: It behaves transparently like the task
  • If timeout < 0: It cancels the task but does not call _cancel_and_wait. In this situation, when wait_for returns, the underlying task may actually still be running
  • If timeout > 0: It cancels the task and calls _cancel_and_wait

@elprans
Copy link
Member

elprans commented Aug 15, 2020

* If `timeout > 0`: It cancels the task and calls `_cancel_and_wait`

Yea, that's my fault, actually. I missed that case when I was fixing wait_for() last time around (in python/cpython#7216) 😄

elprans added a commit that referenced this pull request Aug 16, 2020
`asyncio.wait_for()` currently has a bug where it raises a
`CancelledError` even when the wrapped awaitable has completed.
The upstream fix is in python/cpython#37658.  This adds a workaround
until the aforementioned PR is merged, backported and released.

Fixes: #467
Fixes: #547
Related: #468
Supersedes: #548
elprans added a commit that referenced this pull request Aug 16, 2020
`asyncio.wait_for()` currently has a bug where it raises a
`CancelledError` even when the wrapped awaitable has completed.
The upstream fix is in python/cpython#21894.  This adds a workaround
until the aforementioned PR is merged, backported and released.

Fixes: #467
Fixes: #547
Related: #468
Supersedes: #548
elprans added a commit that referenced this pull request Aug 16, 2020
`asyncio.wait_for()` currently has a bug where it raises a
`CancelledError` even when the wrapped awaitable has completed.
The upstream fix is in python/cpython#21894.  This adds a workaround
until the aforementioned PR is merged, backported and released.

Co-authored-by: Adam Liddell <[email protected]>
Fixes: #467
Fixes: #547
Related: #468
Supersedes: #548
@elprans
Copy link
Member

elprans commented Aug 16, 2020

@aaliddell I opened python/cpython#21894 for the upstream fix and #608 here adds a clean workaround. Thank you for helping to diagnose this!

@aaliddell
Copy link
Contributor Author

Nice! 👍

elprans added a commit that referenced this pull request Aug 27, 2020
`asyncio.wait_for()` currently has a bug where it raises a
`CancelledError` even when the wrapped awaitable has completed.
The upstream fix is in python/cpython#21894.  This adds a workaround
until the aforementioned PR is merged, backported and released.

Co-authored-by: Adam Liddell <[email protected]>
Fixes: #467
Fixes: #547
Related: #468
Supersedes: #548
@elprans
Copy link
Member

elprans commented Aug 27, 2020

Fixed in #608.

@elprans elprans closed this Aug 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pool never closes if acquire with timeout is cancelled
3 participants