Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Idea/Draft/Proposal] Exception handling for server exceptions #5184

Open
fjetter opened this issue Aug 6, 2021 · 9 comments · May be fixed by #5307
Open

[Idea/Draft/Proposal] Exception handling for server exceptions #5184

fjetter opened this issue Aug 6, 2021 · 9 comments · May be fixed by #5307

Comments

@fjetter
Copy link
Member

fjetter commented Aug 6, 2021

Currently there are a lot of exceptions 'lost' in asyncio land because we schedule a lot of our coroutines as tornado callbacks, that's usually a pattern like self.loop.add_callback(do_something_which_might_raise) To not loose these exceptions, there is, for instance, the log_errors handler which logs the exception. Other than the logging system we currently do not have a method to retrieve these exceptions, neither for internal validation (e.g. tests) nor to propagate the exceptions programatically to users

I'm wondering if we can set up an exception handler and forward all exceptions to the running server. These may then be forwarded to the client where we may raise an exception or issue a log message there.

Users might probably want to configure the behaviour for the different kinds of exceptions, e.g. ignore CommClosed but raise on KeyError, etc.

@mrocklin
Copy link
Member

mrocklin commented Aug 6, 2021

That sounds like a good idea to me.

An alternative (not necessarily better) would be to not use add_callback, but instead to track all asyncio futures/tasks that we create and manage them explicitly.

@fjetter
Copy link
Member Author

fjetter commented Aug 6, 2021

but instead to track all asyncio futures/tasks that we create and manage them explicitly.

I also thought about this but I was a bit scared since we're using this at 83 different places and I don't know if this transition would be smooth. Still an option. Note, though, the default asyncio interface like call_later, call_at, etc. would still need something and I do not know how PCs play into this. I'll do a bit of research

@fjetter
Copy link
Member Author

fjetter commented Aug 9, 2021

I tried out a few things. The loop.set_exception_handler is a quite powerful machinery and works exactly as one might imagine. The handler is even called for "never awaited" cases which I think is nice.

# Example ctx
{'message': 'Task exception was never retrieved', 'exception': RuntimeError('task'), 'future': <Task finished name='Task-3' coro=<raise_exception_async() done, defined at /Users/fjetter/workspace/distributed-main/asyncio_exception_handler.py:9> exception=RuntimeError('task')>}

{'message': "Exception in callback raise_exception('call_later') at /Users/fjetter/workspace/distributed-main/asyncio_exception_handler.py:4", 'exception': RuntimeError('call_later'), 'handle': <TimerHandle when=0.1635665 raise_exception('call_later') at /Users/fjetter/workspace/distributed-main/asyncio_exception_handler.py:4>}

A big downside to all of this is that as long as we're using tornado to schedule our callbacks/futures and this includes PeriodicCallbacks, this will not work.
Tornado callbacks are asyncio.Futures with a done callback. The done callback will ensure that the future will always be awaited and the exception is always handled (and logged), such that the asyncio exception handler is never invoked.

@mrocklin
Copy link
Member

mrocklin commented Aug 9, 2021

We can switch out loop.add_callback with asyncio.ensure_future pretty easily I think.

For PeriodicCallbacks we could probably make our own pretty easily. Here is a Stack Overflow answer: https://stackoverflow.com/a/37514633

@fjetter
Copy link
Member Author

fjetter commented Aug 13, 2021

cc @gjoseph92 you expressed interest in "tornado and async-stuff"

@mrocklin
Copy link
Member

Regarding the question in #5217 about whether or not to use PubSub.

Yes, I agree that reinventing pubsub would not be great. Some reasons I might avoid using PubSub here.

My main motivation is mostly for these things to live on the scheduler in deques. I care more about the dashboard experience than about piping things down the client. The current PubSub system intentionally bypasses the scheduler in order to keep things peer-to-peer. The worker<->client relationship is the one place where this is broken. I view this as less important than the worker<-> scheduler relationship.

Additionally, a bit smaller, but we'd want this to be especially fast and low overhead for lots of small msgpack-serializable objects. PubSub was designed more for larger dask-serializable objects.

If we can extend pubsub to deal with all this well then I'm in favor. I view those things as being ...

  1. Add an option for the scheduler to subscribe
  2. Add buffers on the scheduler side
  3. Verify that we're not significantly slower in the case where there is just one publisher and no subscribers (other than the scheduler).
  4. Verify that if we send msgpack serializable data to the scheduler that it's easy to read there

@mrocklin
Copy link
Member

Maybe the original idea of "make it easy for clients to subscribe to a feed of events from the scheduler on log_event" was overly ambitious.

We already have a mechanism for the scheduler to send updates to the client. Maybe there should be a Client._stream_handler that is "warn" or "print" that just issues a warning or prints something. That might be a good general tool to have, and it's easy to build on what we already use day-to-day.

Then in Scheduler.log_event we track a few names and issue warnings or print statements down to the client in those cases.

@fjetter
Copy link
Member Author

fjetter commented Aug 16, 2021

Client._stream_handler that is "warn" or "print" that just issues a warning or prints something. That might be a good general tool to have, and it's easy to build on what we already use day-to-day.

I think what I have over in #5217 is not much more complicated but has a different POV. instead of logging I went for a handler pattern. I wanted to avoid any kind of overly verbose logging


FWIW getting the pubsub thing to work is partially done in #5218 there are still a few open questions, though

Add an option for the scheduler to subscribe

I special cased this with a boolean toggle where events would directly go to the event deques

Add buffers on the scheduler side

What kind of buffers do you mean?

Verify that we're not significantly slower in the case where there is just one publisher and no subscribers (other than the scheduler).

I'd be surprised to see a performance impact since regardless of the implementation, there will be one msg from Worker to Scheduler (assuming no more subscribers).
It will make a dent in scheduler performance once the client subscribes since we'll need to forward msgs there but this is one of the reasons why I don't want to blindly forward everything.

Verify that if we send msgpack serializable data to the scheduler that it's easy to read there

That's one of my current roadblocks, mostly since I do not know what the intention is here and/or I encountered a bug (to_serialize({"my": "message"}) breaks the remote but that's either a minor bug or some missing plumbing on my end)

@fjetter
Copy link
Member Author

fjetter commented Sep 8, 2021

Another possibility to implement this in a quick and less sophisticated way would be to register log handlers and keep tornado. The log handlers would only trigger on exception level logs and would call into the log_event of #5217

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants