bpo-42130: Fix swallowing of cancellation by wait_for #26097

ods · 2021-05-13T10:41:24Z

bpo-37658: In some cases asyncio.wait_for can lead to socket leak. (PR: bpo-37658: Fix asyncio.wait_for() to respect waited task status #21894)
MagicStack/asyncpg#548: Ensure pool connection is released in acquire when cancelled

https://bugs.python.org/issue42130

the-knights-who-say-ni · 2021-05-13T10:41:27Z

Hello, and thanks for your contribution!

I'm a bot set up to make sure that the project can legally accept this contribution by verifying everyone involved has signed the PSF contributor agreement (CLA).

CLA Missing

Our records indicate the following people have not signed the CLA:

@ods

For legal reasons we need all the people listed to sign the CLA before we can look at your contribution. Please follow the steps outlined in the CPython devguide to rectify this issue.

If you have recently signed the CLA, please wait at least one business day
before our records are updated.

You can check yourself to see if the CLA has been received.

Thanks again for the contribution, we look forward to reviewing it!

aaliddell · 2021-05-13T11:30:12Z

Does this actually solve the original issue where a connection/resource can be leaked, or are we just trading one variant of the bug for another? I agree this fixes some odd cancellation behaviour, but are we back to having a likelihood of leaked resources, if I remember correctly.

Following from one of the comments in the second link: I'm still not sure why wait_for is structured the way it is, where the primary await is on the waiter future rather than the user provided future, meaning we have to shoehorn the cancellations back to the original future rather than getting it for free.

As a concept, why do we not have the waiter be a simple call_later handle and add a done callback to fut to cancel it? Likewise, the waiter func just checks if the future is done and cancels it if not. My impression is that this would avoid much of the complexity within wait_for that leads to these race conditions and odd cancellation behaviours?

# NOT TESTED: PoC pseudocode!

def _optionally_cancel_fut(fut, event):
    if not fut.done():
        event.set()
        fut.cancel()

async def wait_for(fut, timeout):
    loop = events.get_running_loop()
    fut = ensure_future(fut, loop=loop)
    timeout_occurred = asyncio.Event()  # Likely not optimal
    if timeout is not None and timeout > 0:
        timeout_handle = loop.call_later(timeout, _optionally_cancel_fut, fut, timeout_occurred)
        fut.add_done_callback(lambda fut: timeout_handle.cancel())
    elif timeout is not None:
        # Timeout must be negative or zero, cancel immediately
        _optionally_cancel_fut(fut, timeout_occurred)

    try:
        return await fut
    except exceptions.CancelledError as exc:
        if timeout_occurred.is_set():
            raise exceptions.TimeoutError() from exc
        raise exc

ods · 2021-05-13T11:38:31Z

Does this actually solve the original issue where a connection/resource can be leaked, or are we just trading one variant of the bug for another? I agree this fixes some odd cancellation behaviour, but are we back to having a likelihood of leaked resources, if I remember correctly.

No, it doesn't solve bpo-37658. But reverted change doesn't solve it too, at least not completely. And we certainly have to work on leaked resources again. The main purpose of this PR is the test case.

aaliddell · 2021-05-13T11:52:50Z

Lib/test/test_asyncio/test_tasks.py

@@ -1147,22 +1147,6 @@ def gen():
        res = loop.run_until_complete(task)
        self.assertEqual(res, "ok")

-    def test_wait_for_cancellation_race_condition(self):


Does this test now fail if it wasn't removed?

Exactly. The test is incorrect (the task must raise) and it doesn't reproduce original problem.

ods · 2021-05-13T16:37:13Z

As a concept, why do we not have the waiter be a simple call_later handle and add a done callback to fut to cancel it?

It's worth a try at least, I'll take it tomorrow.

aaliddell · 2021-05-13T16:49:50Z

As a concept, why do we not have the waiter be a simple call_later handle and add a done callback to fut to cancel it?

It's worth a try at least, I'll take it tomorrow.

👍

I have checked that code in my comment above and it does what is expected at a high level (cancels on timeout etc), but I haven't checked it against the test suite nor against any of the bugs that have led to this. If it works then it's a whole lot easier to reason about than the current impl, but I've probably forgotten something important... 🙄

@aaliddell

Based on pseudocode suggested by @aaliddell python#26097 (comment)

ods · 2021-05-14T10:28:41Z

As a concept, why do we not have the waiter be a simple call_later handle and add a done callback to fut to cancel it?

Here is my first attempt. With this change the following tests fail:

test_asyncio_wait_for_cancelled — with new implementation bad-behaving coroutine doesn't behave well when wrapped into wait_for (this may break a lot of code);
test_wait_for_cancellation_race_condition, i.e. the main goal is not reached;
test_wait_for_raises_timeout_error_if_returned_during_cancellation — quite similar, a bad behaving coroutine is not cancelled and therefore no TimeoutError exception;
test_wait_for_self_cancellation — once again, bad-behaving uncancellable coroutine remains uncancellable when wrapped into wait_for.

Let's put aside arguable incompatible changes in behaviour and elaborate on main goal. In test case set_result is called for inner task we create with ensure_future, while scheduled cancel() is called for outer task that wraps wait_for coroutine. It's done in the same step, so scheduled callbacks didn't propagate the result to outer task yet.

aaliddell · 2021-05-14T15:50:42Z

👍

Regarding the 'badly behaved' coroutines: I'd argue that it is behaving as best as can be expected, since wait_for cannot fix a badly behaved coroutine that swallows a cancellation. This does create an explicit predicate that any coroutine passed to wait_for must be well behaved for it to work as expected, which should be documented if it is not already. From another of my comments in a different thread:

wait_for could transparently behave like the task it wraps. If the task is well-behaved: wait_for should be well behaved. If the task is poorly behaved and returns after swallowing a cancellation: wait_for should be poorly behaved and return after swallowing a cancellation (even if that cancellation was the timeout). The downside here is that TimeoutError will not be raised in that specific situation.

If we try to patch a poorly behaved coroutine by ensuring we raise the timeout regardless of if the cancellation succeeds, we're masking the fact that the task underneath has been poorly behaved and expose ourselves to the resource leak, but I can see this is is perhaps going to be a point of debate 😄. I guess it comes down to two options:

We guarantee that TimeoutError will always be raised at the timeout, even for poorly behaved inputs. However, resources may be leaked due to the cancellation race. This is the current implementation.
We only guarantee the TimeoutError for well behaved inputs, which prevents the cancellation race. However, for badly-behaved inputs, we cannot guarantee that the TimeoutError will be raised if it swallows the cancellation. This is the alternate implementation.

Re test_wait_for_cancellation_race_condition: don't we want this one to fail now? As you pointed out, it may be expecting something incorrect. Or is it now passing when we don't want it to?

ods · 2021-05-14T16:29:28Z

Regarding the 'badly behaved' coroutines: I'd argue that it is behaving as best as can be expected, since wait_for cannot fix a badly behaved coroutine that swallows a cancellation.

IMO, both ways are acceptable. But the important thing is that it's incompatible change that may break someone's code.

Re test_wait_for_cancellation_race_condition: don't we want this one to fail now? As you pointed out, it may be expecting something incorrect. Or is it now passing when we don't want it to?

To me the behaviour it tests shouldn't be guaranteed. Such things should be done with patterns like RAII or context managers. But it's quite problematic with current API and the whole point of trying the new approach was to remove the race in this particular case. And this goal is not reached.

github-actions · 2021-06-15T00:04:06Z

This PR is stale because it has been open for 30 days with no activity.

RyanSept · 2021-08-23T13:54:15Z

Re: the aiokafka library issue tagged above by @ods. I hope this will add some perspective to the issue. When we upgrade to Python 3.8.10, we're no longer able to reliably stop the consumer (close connection and release resources). We wind up needing to toss the stop call into a task and move on but this leaves us with unreleased resources. Our only solution has been to downgrade to Python 3.8.5 which isn't very sustainable in the long run.

Dreamsorcerer · 2021-09-04T15:32:37Z

I think I've fixed this properly for both cases in #28149.

graingert · 2022-01-28T14:35:46Z

I think each time wait_for is cancelled it should pass on the cancellation to the future. A TimeoutError should only be thrown if the timeout expires and wait_for is not subsequently cancelled:

async def wait_for(fut, timeout):
    """Wait for the single Future or coroutine to complete, with timeout.
    Coroutine will be wrapped in Task.
    Returns result of the Future or coroutine.  When a timeout occurs,
    it cancels the task and raises TimeoutError.  To avoid the task
    cancellation, wrap it in shield().
    If the wait is cancelled, the task is also cancelled.
    This function is a coroutine.
    """
    loop = events.get_running_loop()

    if timeout is None:
        return await fut

    timed_out = False
    event = asyncio.Event()
    fut = ensure_future(fut, loop=loop)
    set_event = event.set
    fut.add_done_callback(set_event)

    if timeout <= 0:
        timed_out = True
        fut.cancel()

        while not fut.done():
            try:
                await event.wait()
            except asyncio.CancelledError as e:
                timed_out = False
                fut.cancel()
    else:

        def on_timeout():
            nonlocal timed_out
            timed_out = True
            fut.cancel()

        timeout_handle = loop.call_later(timeout, on_timeout)

        try:
            while not fut.done():
                try:
                    await event.wait()
                except asyncio.CancelledError as e:
                    timed_out = False
                    timeout_handle.cancel()
                    fut.cancel()
        finally:
            timeout_handle.cancel()

    if timed_out:
        try:
            return fut.result()
        except exceptions.CancelledError as exc:
            raise exceptions.TimeoutError() from exc
    return fut.result()

Dreamsorcerer · 2022-01-28T17:25:39Z

I think each time wait_for is cancelled it should pass on the cancellation to the future. A TimeoutError should only be thrown if the timeout expires and wait_for is not subsequently cancelled:

I think you may have missed the issue. There are various timing issues, that result in situations where the future may already be complete when handling the cancellation. By only changing the code in wait_for() you will always have an issue where either the code can produce resource leaks (by not returning the object created in the completed future) or cancellations getting suppressed. See the lengthy discussion in my PR linked in the preceding comment.

So far, the only proposal that might work will require making some changes to how the event loop runs tasks.

robsdedude · 2022-07-20T04:51:05Z

Lib/asyncio/tasks.py

+            fut.remove_done_callback(cb)
+            # We must ensure that the task is not running
+            # after wait_for() returns.
+            # See https://bugs.python.org/issue32751
+            await _cancel_and_wait(fut, loop=loop)
+            raise


I think that the state wait_for is in right now (outside this PR) is as bad as
in swallowing a KeyboardInterrupt in a utility function. It makes it unusable
in my opinion. There is also no way for user code to counteract, whereas the
resource leakage could actually be worked around.

This leads me to possibility 1 how I think this issue could be solved:

Merge this PR, embrace the quirk and document it well.
Tell the user something along the lines of

Cancelling wait_for will leave the waited for future/coroutine in any
state between (including both) not started and finished. Make sure to
provide any necessary clean-up on cancellation.

try: await asyncio.wait_for(foo, 123) except asyncio.CancelledError: ... # do clean-up raise

Or alternatively:

try: future = asyncio.ensure_future(foo) await asyncio.wait_for(future, 123) except asyncio.CancelledError: if fut.done() and not fut.cancelled() and fut.exception() is None: ... # the future completed, yet, we got cancelled raise

After all, if the wrapped future yields multiple times to the event loop, you don't get any guarantees as to how far your inner future got. So the user has to be prepared to clean-up the cancellation mess one or the other way.

Possibility 2 I see is to not dismiss, but to defer the cancellation:

Suggested change

fut.remove_done_callback(cb)

# We must ensure that the task is not running

# after wait_for() returns.

# See https://bugs.python.org/issue32751

await _cancel_and_wait(fut, loop=loop)

raise

if fut.done():

# We got cancelled, but we are already done. Therefore, we defer the

# cancellation until next time the task yields to the event loop.

current_task().cancel()

return fut.result()

else:

fut.remove_done_callback(cb)

# We must ensure that the task is not running

# after wait_for() returns.

# See https://bugs.python.org/issue32751

await _cancel_and_wait(fut, loop=loop)

raise

To amend to that: this is the issue that I thinks needs a quick resolution (using pytest here, but it should get the point across):

@pytest.mark.asyncio async def test_wait_for_cancellation_propagates(): inner = asyncio.get_event_loop().create_future() outer = asyncio.wait_for(inner, 0.1) outer_future = asyncio.ensure_future(outer) await asyncio.sleep(0) inner.set_result(None) # inner is done outer_future.cancel() # AND outer got cancelled # this fails starting with Python 3.8 with pytest.raises(asyncio.CancelledError): await outer_future

graingert · 2022-07-20T10:37:23Z

I think #31847 will fix this issue

robsdedude · 2022-07-20T11:09:42Z

I think #31847 will fix this issue

That might be true for 3.11 but 3.8 - 3.10 still should be fixed.

graingert · 2022-07-20T12:32:10Z

I think #31847 will fix this issue

That might be true for 3.11 but 3.8 - 3.10 still should be fixed.

and in https://www.python.org/downloads/release/python-3913/

According to the release calendar specified in PEP 596, Python 3.9.13 is the final regular maintenance release. Starting now, the 3.9 branch will only accept security fixes and releases of those will be made in source-only form until October 2025.

so this could only be backported as far as 3.10

Dreamsorcerer · 2022-07-20T17:08:02Z

I think #31847 will fix this issue

I'll try and get back to finishing the tests in my PR, which should comprehensively cover all these edge cases and run them against @asvetlov's branch to see if it improves the situation or not.

graingert · 2022-07-20T17:20:04Z

I think #31847 will fix this issue

I'll try and get back to finishing the tests in my PR, which should comprehensively cover all these edge cases and run them against @asvetlov's branch to see if it improves the situation or not.

@Dreamsorcerer I also had a go here bf594bd but found a bug in asyncio.timeout

kumaraditya303 · 2023-03-14T08:47:02Z

Superseded by #96764

ods added 2 commits May 13, 2021 13:21

bpo-42130: Add test for swallowed cancellation

acb1522

Revert pythonGH-21894 as not valid solution for bpo-37658

053f426

ods requested review from 1st1 and asvetlov as code owners May 13, 2021 10:41

bedevere-bot added the awaiting review label May 13, 2021

the-knights-who-say-ni added the CLA not signed label May 13, 2021

📜🤖 Added by blurb_it.

fd4ea70

aaliddell reviewed May 13, 2021

View reviewed changes

eIGato approved these changes May 13, 2021

View reviewed changes

bedevere-bot added awaiting core review and removed awaiting review labels May 13, 2021

ods added a commit to ods/cpython that referenced this pull request May 14, 2021

Alternative asyncio.wait_for implementation without waiter

0bdc795

Based on pseudocode suggested by @aaliddell python#26097 (comment)

github-actions bot added the stale Stale PR or inactive for long period of time. label Jun 15, 2021

the-knights-who-say-ni added CLA signed and removed CLA not signed labels Jun 27, 2021

ods mentioned this pull request Aug 23, 2021

Stopping multiple consumers in a short window freezes on GroupCoordinator._heartbeat_routine() aio-libs/aiokafka#773

Open

tvoinarovskyi mannequin mentioned this pull request Apr 28, 2022

AsyncIO's wait_for can hide cancellation in a rare race condition #86296

Closed

NikitaIlyasov mannequin mentioned this pull request Apr 10, 2022

In some cases asyncio.wait_for can lead to socket leak. #81839

Closed

ezio-melotti removed the CLA signed label Jul 13, 2022

robsdedude reviewed Jul 20, 2022

View reviewed changes

github-actions bot removed the stale Stale PR or inactive for long period of time. label Aug 8, 2022

kumaraditya303 mentioned this pull request Oct 22, 2022

asyncio.wait_for: process future result produced during cancelation #84849

Closed

robsdedude mentioned this pull request Dec 14, 2022

GH-86296: Fix for asyncio.wait_for() swallowing cancellation, and add tests #98607

Closed

kumaraditya303 closed this Mar 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bpo-42130: Fix swallowing of cancellation by wait_for #26097

bpo-42130: Fix swallowing of cancellation by wait_for #26097

ods commented May 13, 2021 •

edited by bedevere-bot

Loading

the-knights-who-say-ni commented May 13, 2021

aaliddell commented May 13, 2021 •

edited

Loading

ods commented May 13, 2021 •

edited

Loading

aaliddell May 13, 2021

ods May 13, 2021

ods commented May 13, 2021

aaliddell commented May 13, 2021

ods commented May 14, 2021 •

edited

Loading

aaliddell commented May 14, 2021 •

edited

Loading

ods commented May 14, 2021

github-actions bot commented Jun 15, 2021

RyanSept commented Aug 23, 2021

Dreamsorcerer commented Sep 4, 2021

graingert commented Jan 28, 2022

Dreamsorcerer commented Jan 28, 2022

robsdedude Jul 20, 2022 •

edited

Loading

robsdedude Jul 20, 2022 •

edited

Loading

graingert commented Jul 20, 2022

robsdedude commented Jul 20, 2022

graingert commented Jul 20, 2022 •

edited

Loading

Dreamsorcerer commented Jul 20, 2022

graingert commented Jul 20, 2022 •

edited

Loading

kumaraditya303 commented Mar 14, 2023

bpo-42130: Fix swallowing of cancellation by wait_for #26097

bpo-42130: Fix swallowing of cancellation by wait_for #26097

Conversation

ods commented May 13, 2021 • edited by bedevere-bot Loading

the-knights-who-say-ni commented May 13, 2021

CLA Missing

aaliddell commented May 13, 2021 • edited Loading

ods commented May 13, 2021 • edited Loading

aaliddell May 13, 2021

Choose a reason for hiding this comment

ods May 13, 2021

Choose a reason for hiding this comment

ods commented May 13, 2021

aaliddell commented May 13, 2021

ods commented May 14, 2021 • edited Loading

aaliddell commented May 14, 2021 • edited Loading

ods commented May 14, 2021

github-actions bot commented Jun 15, 2021

RyanSept commented Aug 23, 2021

Dreamsorcerer commented Sep 4, 2021

graingert commented Jan 28, 2022

Dreamsorcerer commented Jan 28, 2022

robsdedude Jul 20, 2022 • edited Loading

Choose a reason for hiding this comment

robsdedude Jul 20, 2022 • edited Loading

Choose a reason for hiding this comment

graingert commented Jul 20, 2022

robsdedude commented Jul 20, 2022

graingert commented Jul 20, 2022 • edited Loading

Dreamsorcerer commented Jul 20, 2022

graingert commented Jul 20, 2022 • edited Loading

kumaraditya303 commented Mar 14, 2023

ods commented May 13, 2021 •

edited by bedevere-bot

Loading

aaliddell commented May 13, 2021 •

edited

Loading

ods commented May 13, 2021 •

edited

Loading

ods commented May 14, 2021 •

edited

Loading

aaliddell commented May 14, 2021 •

edited

Loading

robsdedude Jul 20, 2022 •

edited

Loading

robsdedude Jul 20, 2022 •

edited

Loading

graingert commented Jul 20, 2022 •

edited

Loading

graingert commented Jul 20, 2022 •

edited

Loading