Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpo-42130: Fix swallowing of cancellation by wait_for #26097

Closed
wants to merge 3 commits into from

Conversation

ods
Copy link

@ods ods commented May 13, 2021

@the-knights-who-say-ni
Copy link

Hello, and thanks for your contribution!

I'm a bot set up to make sure that the project can legally accept this contribution by verifying everyone involved has signed the PSF contributor agreement (CLA).

CLA Missing

Our records indicate the following people have not signed the CLA:

@ods

For legal reasons we need all the people listed to sign the CLA before we can look at your contribution. Please follow the steps outlined in the CPython devguide to rectify this issue.

If you have recently signed the CLA, please wait at least one business day
before our records are updated.

You can check yourself to see if the CLA has been received.

Thanks again for the contribution, we look forward to reviewing it!

@aaliddell
Copy link

aaliddell commented May 13, 2021

Does this actually solve the original issue where a connection/resource can be leaked, or are we just trading one variant of the bug for another? I agree this fixes some odd cancellation behaviour, but are we back to having a likelihood of leaked resources, if I remember correctly.

Following from one of the comments in the second link: I'm still not sure why wait_for is structured the way it is, where the primary await is on the waiter future rather than the user provided future, meaning we have to shoehorn the cancellations back to the original future rather than getting it for free.

As a concept, why do we not have the waiter be a simple call_later handle and add a done callback to fut to cancel it? Likewise, the waiter func just checks if the future is done and cancels it if not. My impression is that this would avoid much of the complexity within wait_for that leads to these race conditions and odd cancellation behaviours?

# NOT TESTED: PoC pseudocode!

def _optionally_cancel_fut(fut, event):
    if not fut.done():
        event.set()
        fut.cancel()

async def wait_for(fut, timeout):
    loop = events.get_running_loop()
    fut = ensure_future(fut, loop=loop)
    timeout_occurred = asyncio.Event()  # Likely not optimal
    if timeout is not None and timeout > 0:
        timeout_handle = loop.call_later(timeout, _optionally_cancel_fut, fut, timeout_occurred)
        fut.add_done_callback(lambda fut: timeout_handle.cancel())
    elif timeout is not None:
        # Timeout must be negative or zero, cancel immediately
        _optionally_cancel_fut(fut, timeout_occurred)

    try:
        return await fut
    except exceptions.CancelledError as exc:
        if timeout_occurred.is_set():
            raise exceptions.TimeoutError() from exc
        raise exc

@ods
Copy link
Author

ods commented May 13, 2021

Does this actually solve the original issue where a connection/resource can be leaked, or are we just trading one variant of the bug for another? I agree this fixes some odd cancellation behaviour, but are we back to having a likelihood of leaked resources, if I remember correctly.

No, it doesn't solve bpo-37658. But reverted change doesn't solve it too, at least not completely. And we certainly have to work on leaked resources again. The main purpose of this PR is the test case.

@@ -1147,22 +1147,6 @@ def gen():
res = loop.run_until_complete(task)
self.assertEqual(res, "ok")

def test_wait_for_cancellation_race_condition(self):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this test now fail if it wasn't removed?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly. The test is incorrect (the task must raise) and it doesn't reproduce original problem.

@ods
Copy link
Author

ods commented May 13, 2021

As a concept, why do we not have the waiter be a simple call_later handle and add a done callback to fut to cancel it?

It's worth a try at least, I'll take it tomorrow.

@aaliddell
Copy link

As a concept, why do we not have the waiter be a simple call_later handle and add a done callback to fut to cancel it?

It's worth a try at least, I'll take it tomorrow.

👍

I have checked that code in my comment above and it does what is expected at a high level (cancels on timeout etc), but I haven't checked it against the test suite nor against any of the bugs that have led to this. If it works then it's a whole lot easier to reason about than the current impl, but I've probably forgotten something important... 🙄

ods added a commit to ods/cpython that referenced this pull request May 14, 2021
@ods
Copy link
Author

ods commented May 14, 2021

As a concept, why do we not have the waiter be a simple call_later handle and add a done callback to fut to cancel it?

Here is my first attempt. With this change the following tests fail:

  • test_asyncio_wait_for_cancelled — with new implementation bad-behaving coroutine doesn't behave well when wrapped into wait_for (this may break a lot of code);
  • test_wait_for_cancellation_race_condition, i.e. the main goal is not reached;
  • test_wait_for_raises_timeout_error_if_returned_during_cancellation — quite similar, a bad behaving coroutine is not cancelled and therefore no TimeoutError exception;
  • test_wait_for_self_cancellation — once again, bad-behaving uncancellable coroutine remains uncancellable when wrapped into wait_for.

Let's put aside arguable incompatible changes in behaviour and elaborate on main goal. In test case set_result is called for inner task we create with ensure_future, while scheduled cancel() is called for outer task that wraps wait_for coroutine. It's done in the same step, so scheduled callbacks didn't propagate the result to outer task yet.

@aaliddell
Copy link

aaliddell commented May 14, 2021

👍

Regarding the 'badly behaved' coroutines: I'd argue that it is behaving as best as can be expected, since wait_for cannot fix a badly behaved coroutine that swallows a cancellation. This does create an explicit predicate that any coroutine passed to wait_for must be well behaved for it to work as expected, which should be documented if it is not already. From another of my comments in a different thread:

wait_for could transparently behave like the task it wraps. If the task is well-behaved: wait_for should be well behaved. If the task is poorly behaved and returns after swallowing a cancellation: wait_for should be poorly behaved and return after swallowing a cancellation (even if that cancellation was the timeout). The downside here is that TimeoutError will not be raised in that specific situation.

If we try to patch a poorly behaved coroutine by ensuring we raise the timeout regardless of if the cancellation succeeds, we're masking the fact that the task underneath has been poorly behaved and expose ourselves to the resource leak, but I can see this is is perhaps going to be a point of debate 😄. I guess it comes down to two options:

  • We guarantee that TimeoutError will always be raised at the timeout, even for poorly behaved inputs. However, resources may be leaked due to the cancellation race. This is the current implementation.
  • We only guarantee the TimeoutError for well behaved inputs, which prevents the cancellation race. However, for badly-behaved inputs, we cannot guarantee that the TimeoutError will be raised if it swallows the cancellation. This is the alternate implementation.

Re test_wait_for_cancellation_race_condition: don't we want this one to fail now? As you pointed out, it may be expecting something incorrect. Or is it now passing when we don't want it to?

@ods
Copy link
Author

ods commented May 14, 2021

Regarding the 'badly behaved' coroutines: I'd argue that it is behaving as best as can be expected, since wait_for cannot fix a badly behaved coroutine that swallows a cancellation.

IMO, both ways are acceptable. But the important thing is that it's incompatible change that may break someone's code.

Re test_wait_for_cancellation_race_condition: don't we want this one to fail now? As you pointed out, it may be expecting something incorrect. Or is it now passing when we don't want it to?

To me the behaviour it tests shouldn't be guaranteed. Such things should be done with patterns like RAII or context managers. But it's quite problematic with current API and the whole point of trying the new approach was to remove the race in this particular case. And this goal is not reached.

@github-actions
Copy link

This PR is stale because it has been open for 30 days with no activity.

@RyanSept
Copy link

Re: the aiokafka library issue tagged above by @ods. I hope this will add some perspective to the issue. When we upgrade to Python 3.8.10, we're no longer able to reliably stop the consumer (close connection and release resources). We wind up needing to toss the stop call into a task and move on but this leaves us with unreleased resources. Our only solution has been to downgrade to Python 3.8.5 which isn't very sustainable in the long run.

@Dreamsorcerer
Copy link
Contributor

I think I've fixed this properly for both cases in #28149.

@graingert
Copy link
Contributor

I think each time wait_for is cancelled it should pass on the cancellation to the future. A TimeoutError should only be thrown if the timeout expires and wait_for is not subsequently cancelled:

async def wait_for(fut, timeout):
    """Wait for the single Future or coroutine to complete, with timeout.
    Coroutine will be wrapped in Task.
    Returns result of the Future or coroutine.  When a timeout occurs,
    it cancels the task and raises TimeoutError.  To avoid the task
    cancellation, wrap it in shield().
    If the wait is cancelled, the task is also cancelled.
    This function is a coroutine.
    """
    loop = events.get_running_loop()

    if timeout is None:
        return await fut

    timed_out = False
    event = asyncio.Event()
    fut = ensure_future(fut, loop=loop)
    set_event = event.set
    fut.add_done_callback(set_event)

    if timeout <= 0:
        timed_out = True
        fut.cancel()

        while not fut.done():
            try:
                await event.wait()
            except asyncio.CancelledError as e:
                timed_out = False
                fut.cancel()
    else:

        def on_timeout():
            nonlocal timed_out
            timed_out = True
            fut.cancel()

        timeout_handle = loop.call_later(timeout, on_timeout)

        try:
            while not fut.done():
                try:
                    await event.wait()
                except asyncio.CancelledError as e:
                    timed_out = False
                    timeout_handle.cancel()
                    fut.cancel()
        finally:
            timeout_handle.cancel()

    if timed_out:
        try:
            return fut.result()
        except exceptions.CancelledError as exc:
            raise exceptions.TimeoutError() from exc
    return fut.result()

@Dreamsorcerer
Copy link
Contributor

I think each time wait_for is cancelled it should pass on the cancellation to the future. A TimeoutError should only be thrown if the timeout expires and wait_for is not subsequently cancelled:

I think you may have missed the issue. There are various timing issues, that result in situations where the future may already be complete when handling the cancellation. By only changing the code in wait_for() you will always have an issue where either the code can produce resource leaks (by not returning the object created in the completed future) or cancellations getting suppressed. See the lengthy discussion in my PR linked in the preceding comment.

So far, the only proposal that might work will require making some changes to how the event loop runs tasks.

Comment on lines +436 to +441
fut.remove_done_callback(cb)
# We must ensure that the task is not running
# after wait_for() returns.
# See https://bugs.python.org/issue32751
await _cancel_and_wait(fut, loop=loop)
raise
Copy link

@robsdedude robsdedude Jul 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the state wait_for is in right now (outside this PR) is as bad as
in swallowing a KeyboardInterrupt in a utility function. It makes it unusable
in my opinion. There is also no way for user code to counteract, whereas the
resource leakage could actually be worked around.

This leads me to possibility 1 how I think this issue could be solved:

Merge this PR, embrace the quirk and document it well.
Tell the user something along the lines of

Cancelling wait_for will leave the waited for future/coroutine in any
state between (including both) not started and finished. Make sure to
provide any necessary clean-up on cancellation.

try:
     await asyncio.wait_for(foo, 123)
 except asyncio.CancelledError:
     ...  # do clean-up
     raise

Or alternatively:

try:
     future = asyncio.ensure_future(foo) 
     await asyncio.wait_for(future, 123)
 except asyncio.CancelledError:
     if fut.done() and not fut.cancelled() and fut.exception() is None:
         ...  # the future completed, yet, we got cancelled
     raise

After all, if the wrapped future yields multiple times to the event loop, you don't get any guarantees as to how far your inner future got. So the user has to be prepared to clean-up the cancellation mess one or the other way.

Possibility 2 I see is to not dismiss, but to defer the cancellation:

Suggested change
fut.remove_done_callback(cb)
# We must ensure that the task is not running
# after wait_for() returns.
# See https://bugs.python.org/issue32751
await _cancel_and_wait(fut, loop=loop)
raise
if fut.done():
# We got cancelled, but we are already done. Therefore, we defer the
# cancellation until next time the task yields to the event loop.
current_task().cancel()
return fut.result()
else:
fut.remove_done_callback(cb)
# We must ensure that the task is not running
# after wait_for() returns.
# See https://bugs.python.org/issue32751
await _cancel_and_wait(fut, loop=loop)
raise

Copy link

@robsdedude robsdedude Jul 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To amend to that: this is the issue that I thinks needs a quick resolution (using pytest here, but it should get the point across):

@pytest.mark.asyncio
async def test_wait_for_cancellation_propagates():
    inner = asyncio.get_event_loop().create_future()
    outer = asyncio.wait_for(inner, 0.1)
    outer_future = asyncio.ensure_future(outer)
    await asyncio.sleep(0)
    inner.set_result(None)  # inner is done
    outer_future.cancel()  # AND outer got cancelled

    # this fails starting with Python 3.8
    with pytest.raises(asyncio.CancelledError):
        await outer_future

@graingert
Copy link
Contributor

I think #31847 will fix this issue

@robsdedude
Copy link

I think #31847 will fix this issue

That might be true for 3.11 but 3.8 - 3.10 still should be fixed.

@graingert
Copy link
Contributor

graingert commented Jul 20, 2022

I think #31847 will fix this issue

That might be true for 3.11 but 3.8 - 3.10 still should be fixed.

and in https://www.python.org/downloads/release/python-3913/

According to the release calendar specified in PEP 596, Python 3.9.13 is the final regular maintenance release. Starting now, the 3.9 branch will only accept security fixes and releases of those will be made in source-only form until October 2025.

so this could only be backported as far as 3.10

@Dreamsorcerer
Copy link
Contributor

I think #31847 will fix this issue

I'll try and get back to finishing the tests in my PR, which should comprehensively cover all these edge cases and run them against @asvetlov's branch to see if it improves the situation or not.

@graingert
Copy link
Contributor

graingert commented Jul 20, 2022

I think #31847 will fix this issue

I'll try and get back to finishing the tests in my PR, which should comprehensively cover all these edge cases and run them against @asvetlov's branch to see if it improves the situation or not.

@Dreamsorcerer I also had a go here bf594bd but found a bug in asyncio.timeout

@kumaraditya303
Copy link
Contributor

Superseded by #96764

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.