Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

asyncio.TaskGroup may silently discard request to run a task #116048

Open
arthur-tacca opened this issue Feb 28, 2024 · 7 comments
Open

asyncio.TaskGroup may silently discard request to run a task #116048

arthur-tacca opened this issue Feb 28, 2024 · 7 comments
Labels
topic-asyncio type-bug An unexpected behavior, bug, or error

Comments

@arthur-tacca
Copy link

arthur-tacca commented Feb 28, 2024

Bug report

Bug description:

As I understand it, asyncio.TaskGroup should never silently lose a task. Any time you use TaskGroup.create_task(), one of these two things happens:

  • The coroutine you pass runs to the end, with the TaskGroup context manager waiting until this happens. Possibly the task is cancelled, maybe due to another task in the group raising an exception, but the coroutine "sees" this (it gets the cancellation exception) so it can always do some handling of it if needed, and the task group still waits for that all to finish.
  • The TaskGroup.create_task() method raises a RuntimeError because it is not possible to start the task. This happens when the task group is not running (because the context manager hasn't been entered or has already exited), or because it is in the process of shutting down (because one of the tasks in it has finished with an exception).

(Disclaimer: My background is as a user of Trio, where these are definitely the only two possibilities. The main difference is that starting a task in an active but cancelled Trio nursery will start a task and cancel it immediately, which allows it to run to its first unshielded await, rather than raising an exception from start_soon(). But that's a small design difference. The point is that you are still guaranteed one of the two possibilities.)

However, a task can be silently lost if the task group it is in gets shut down before a recently-created task has a chance to get scheduled at all, as in this example:

async with asyncio.TaskGroup() as tg:
    tg.create_task(my_fn(3))
    raise RuntimeError

This snippet seems a bit silly because the task group gets shut down by an exception from the same child task that is spawning a new sibling. But the same situation can happen when an uncaught exception gets thrown by one child at roughly the same time that another child has spawned a sibling. (I came across this issue by launching a task from an inner task group while the outer task group was in the process of shutting down.)

Overall, this follows from this behaviour of asyncio tasks:

t = asyncio.create_task(my_fn())
t.cancel()

This will not run my_fn() at all, not even to the first await within it. This is despite the fact that the docs for asyncio.Task.cancel() say:

This arranges for a CancelledError exception to be thrown into the wrapped coroutine on the next cycle of the event loop. The coroutine then has a chance to clean up or even deny the request ...

I looked over old issues to see if this had been reported and found the #98275 which suggested changing the docs to warn about this, but it has since been closed.

Personally, I would say that the behaviour of create_task and Task.cancel() are incorrect, but looking back at that discussion I can see that this is a matter of opinion. However, I think the task group behaviour really does need to be fixed. It's hard to see how this could be reliably done with the current task behaviour, so I think that gives some weight that it really is the undelying issue.

Perhaps, as a compromise, there could be a new parameter asyncio.create_task(..., run_to_first_await=False), which can be set to True by TaskGroup and other users wanting robust cancellation detection?

CPython versions tested on:

3.12

Operating systems tested on:

Windows

@arthur-tacca arthur-tacca added the type-bug An unexpected behavior, bug, or error label Feb 28, 2024
@arthur-tacca
Copy link
Author

Related: #115957

@gvanrossum
Copy link
Member

I don't feel like introducing flags for alternate semantics. When you create a coroutine and immediately throw an exception into it, the coroutine (which hasn't started executing its body yet) will be cancelled without getting a chance to clean up. Those are the semantics of coroutines defined with async def and the asyncio module faithfully follows this principle.

It's fine to add some documentation for this, but I'm not sure that the various create_task() functions and methods are the place for that -- the behavior is inherent in async def so should be clarified there (if it isn't already).

@arthur-tacca
Copy link
Author

I do see your point: if an abstraction has very similar semantics to the layer below, it's often simplest to declare the semantics are exactly the same and be done with it. But I think most of asyncio's users will miss this detail, even with the suggested documentation note. I certainly missed it even though I was specifically looking for race conditions.

I'm mainly concerned about task groups because the whole point of them is to protect you from edge cases when there are exceptions. But you can use them to write simple code that looks clearly correct, and could be correct (and should be IMO), but suffers from this issue. Like this (full demo):

async def use_connection(conn):
    try:
        await conn.use()  # Might raise exception
    finally:
        conn.close()

async def process_data(item_iter):
    async with asyncio.TaskGroup() as tg:
        async for item in item_iter:
            conn = create_connection(item)
            tg.create_task(use_connection(conn))

Even if you do realise that above code is broken (perhaps by finding out the hard way), there's no really simple fix for it. (Of course you could move connection creation to use_connection() but that's just because it's a toy example; in a real program, that might be impossible without significant restructuring.) The best I can manage involves changing the driver function and wrapping the worker function, like this:

async def use_connection(conn):
    try:
        await conn.use()  # Might raise exception
    finally:
        conn.close()

async def use_connection_wrapper(conn, open_connections):
    open_connections.remove(conn)
    await use_connection(conn)

async def process_data(item_iter):
    open_connections = set()
    try:
        async with asyncio.TaskGroup() as tg:
            async for item in item_iter:
                conn = create_connection(item)
                open_connections.add(conn)
                tg.create_task(use_connection_wrapper(conn, open_connections))
    finally:
        for conn in open_connections:
            conn.close()

If you look at this through the eyes of an asyncio user that doesn't know every low-level detail, it's hard to see the second snippet as anything other than a hack to work around a bug in asyncio. It's certainly not the natural way to write it.

@arthur-tacca
Copy link
Author

Although I've not tested it, I believe this issue also applies to asyncio.start_server when used with task groups:

async def handle_connection(reader: asyncio.StreamReader, writer: asyncio.StreamWriter):
    try:
        await asyncio.sleep(1)  # ... use connection ...
    finally:
        writer.transport.abort()

async def run_server(port):
    async with asyncio.TaskGroup() as tg:
        def handle_in_taskgroup(r, w):
            tg.create_task(handle_connection(r, w))
        server = await asyncio.start_server(handle_in_taskgroup, port=port)
        tg.create_task(server.serve_forever())

This is not a surprise, as it follows the same pattern as the snippet in my previous comment. It could be worked around in the same way, but again I think that is very non-obvious.

It would be convenient, regardless of safety, if you could specify a task group to start_server (like the handler_nursery= parameter to trio.serve_tcp()):

await asyncio.start_server(handle_connection, port=port, task_group=tg)

That could be used as a reason not to fix this issue in general, because this form of start_server() would give an opportunity to fix this in start_server() itself. But I think this still illustrates how pervasive this type of problem is.

@gvanrossum
Copy link
Member

Let's move the feature request for start_server() to a different issue (and wait until this one has settled).

I feel that the best way forward for you, and for others who feel this is a misfeature of coroutine design that asyncio should explicitly work around, is to turn on eager tasks. This is available from 3.12 onward. If eager tasks don't fit your usage pattern, please explain.

@arthur-tacca
Copy link
Author

It feels a little wrong to use an optimisation (at least I assumed that's what that was for) that happens to have a semantic side effort as a work around this, but to be fair I can't think of any specific problem with it.

The original comment issue in the issue for that, #97696, actually mentioned this specific case, but I think it would be very hard for most asyncio users to figure this out for themselves. My preference would be for a warning in the TaskGroup docs along the lines of "warning: task groups are unsafe / broken unless you use the magic incantation asyncio.get_running_loop().set_task_factory(asyncio.eager_task_factory) at the start of your program." But I understand that something more balanced might be more appropriate.

@arthur-tacca
Copy link
Author

Oh and there's an existing discussion on Discuss of the API change for start_server() so I posted about that there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-asyncio type-bug An unexpected behavior, bug, or error
Projects
Status: Todo
Development

No branches or pull requests

3 participants