Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Async stream handlers #4474

Closed
wants to merge 6 commits into from

Conversation

ian-r-rose
Copy link
Collaborator

Fixes #4469.

I haven't done much profiling yet to see what kind of performance improvements this has (if any), I'll take a look at that next and post some results here.

import weakref

from .core import CommClosedError
from .metrics import time
from .utils import sync, TimeoutError, parse_timedelta
from .protocol.serialize import to_serialize

if typing.TYPE_CHECKING:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found adding type annotations to the stream_handlers object was useful in implementing this (cf #2803), allowing type checkers to catch which handlers were async or sync. However, I needed this pretty unsightly hack to avoid a circular import, just to get the type name in the module scope. I'm not happy about it, and it could be removed. On the other hand, it's kind of nice for refactoring.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does from __future__ import annotations help at all here? Admittedly that is Python 3.7+ only. Though we are planning to drop Python 3.6 soon ( #4390 )

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I had to remove annotations in 4f1944b for 3.6 support. Though it didn't help that much, it mostly meant I didn't have to quote the type names below. The circular import problem remained.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jakirkham it doesn't help - you still need those imports. from __future__ import annotations helps when you have a class accepting itself or returning itself in its methods, or when class A in a module accepts or returns class B which is declared afterwards in the same module.

T = TypeVar("T")


def asyncify(func: Callable[..., T]) -> Callable[..., Awaitable[T]]:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In some places I've used an async wrapper to have a lighter footprint on the API, but more functions could just be made async in the first place.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I almost raised a question about this until I looked at how these methods were used. I agree that making them sync by default is probably the right choice. This seemed like a good solution to me.

@mrocklin
Copy link
Member

mrocklin commented Feb 1, 2021

I haven't done much profiling yet to see what kind of performance improvements this has (if any), I'll take a look at that next and post some results here.

Ooh, fun. I look forward to seeing which of the many different ways of profiling you choose :)

@mrocklin
Copy link
Member

mrocklin commented Feb 3, 2021 via email

@@ -3,20 +3,25 @@
from contextlib import suppress
import logging
import threading
import typing
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as a personal esthetic preference, I find that using from typing import ... is always a nicer option. It does lead to collisions with from collections.abc import ..., which can however be worked around after you drop Python 3.6 support, add from __future__ import annotations everywhere, and you run a Python 3.9-compatible version of mypy (on python 3.7+).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I generally agree with you here @crusaderky. In this case I kept it this way because the thing I'm using is typing.TYPE_CHECKING, and I didn't really want that floating around the module namespace. No strong preference either way, though.

@mrocklin
Copy link
Member

mrocklin commented Feb 3, 2021

So, a thought just came to me. I wonder if all of this could be solved by lru cache.

Expanding on this a bit more (that thought came to me last night and I wanted to get it out quickly). I think that we actually prefer being able to use both sync and async handlers, but we're choosing not to do this for performance reasons, mostly because the inspect.is_coroutine_function function is surprisingly slow. An alternative that might make it fast is to cache the result of this function. This might be a one-line fix

is_coroutine_function = functools.lru_cache(inspect.is_coroutine_function)

This does open up questions about how expensive lru_cache is, but I suspect that it's significantly faster.

@ian-r-rose
Copy link
Collaborator Author

So, a thought just came to me. I wonder if all of this could be solved by lru cache.

Expanding on this a bit more (that thought came to me last night and I wanted to get it out quickly). I think that we actually prefer being able to use both sync and async handlers, but we're choosing not to do this for performance reasons, mostly because the inspect.is_coroutine_function function is surprisingly slow. An alternative that might make it fast is to cache the result of this function. This might be a one-line fix

This is a good point, I'll test it out

@ian-r-rose
Copy link
Collaborator Author

Okay, here is a look at some profiling results. I've mostly followed the same strategy detailed by @mrocklin in #4443 (comment) using viztracer and shuffling the same large timeseries. The specific durations for the functions in question are small, but they are typically called thousands of times a second, so it does add up.

I've looked at three cases:

  1. The current main
  2. The async-only stream handler implemented here.
  3. Using an lru cache for is_coroutine_function

Specifically, I'm looking at the profile of Server.handle_stream and is_coroutine_function. The latter only applies in cases (1) and (3).

is_coroutine_function:

Here is a histogram of durations for all sampled calls of is_coroutine_function:
image

As expected, using the LRU cache is significantly faster bringing the average duration from 15-20 ns down to 2-3 ns. This is a one-line solution, and keeps the API unchanged.

handle_stream:

Here is a histogram of duraitons for all sampled calls of handle_stream:

image

A few observations about this:

  • The main and LRU cases both have pretty wide distributions, depending on which synchronous op comes in. They average about 500 ns.
  • It's hard to tell from the image alone, but the average duration of the LRU case is in fact shorter than that of the main case, by some tens of ns. It also looks peakier, maybe because of the more consistent duration of the cached is_coroutine_function
  • The fully async case has a completely different distribution. In this case all handlers are pushed onto the async stack, so we aren't blocking on anything. It has a shorter average duration than the other two (~60 ns). Now, that time difference should be paid for later by the ioloop scheduling the previously-synchronous handlers at some other point.

Overall, it seems to me like the LRU version is indeed a quick win if we don't want to change any of the API, and continue to allow for synchronous handlers. Going fully async may be worthwhile, but involves a breaking API change, and maybe some pain for downstream plugin authors.

@jakirkham
Copy link
Member

Thanks for the detailed results Ian 😄

Do we know how many plugins are developed for Distributed?

@jakirkham
Copy link
Member

Oh sorry one more question, was the asyncio case using Tornado or uvloop? Wouldn't expect a difference, but would be good to know if there happened to be one (as we are using the latter in performance sensitive cases atm)

@ian-r-rose
Copy link
Collaborator Author

Do we know how many plugins are developed for Distributed?

Good question. I'm not aware of any third-party ones myself, but I don't see any reason why ones couldn't do something similar to this one.

Oh sorry one more question, was the asyncio case using Tornado or uvloop? Wouldn't expect a difference, but would be good to know if there happened to be one (as we are using the latter in performance sensitive cases atm)

This was with uvloop, I haven't done extensive profiling with tornado at the moment

@mrocklin
Copy link
Member

mrocklin commented Feb 3, 2021

Thank you for the detailed analysis @ian-r-rose

It sounds like we're going with lru?

@ian-r-rose
Copy link
Collaborator Author

It sounds like we're going with lru?

I'm happy to go with that, it's a simpler change for the time being (though I generally like more predictable async vs sync APIs where possible).

@mrocklin
Copy link
Member

mrocklin commented Feb 3, 2021 via email

@ian-r-rose ian-r-rose mentioned this pull request Feb 3, 2021
2 tasks
@jakirkham
Copy link
Member

Went ahead and merged Ian's LRU cache PR ( #4481 ). It's a pretty light PR. This coroutine check also only happens in a handful of places.

That said, I don't think it rules this async approach. There may be other reasons (as noted above) for going this route.

@ian-r-rose
Copy link
Collaborator Author

I'll close this as #4474 is in, but will keep the branch around in case we want to revisit later

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Avoid is_coroutine_function calls in Server
4 participants