[Idea/Draft/Proposal] Exception handling for server exceptions #5184

fjetter · 2021-08-06T14:37:23Z

Currently there are a lot of exceptions 'lost' in asyncio land because we schedule a lot of our coroutines as tornado callbacks, that's usually a pattern like self.loop.add_callback(do_something_which_might_raise) To not loose these exceptions, there is, for instance, the log_errors handler which logs the exception. Other than the logging system we currently do not have a method to retrieve these exceptions, neither for internal validation (e.g. tests) nor to propagate the exceptions programatically to users

I'm wondering if we can set up an exception handler and forward all exceptions to the running server. These may then be forwarded to the client where we may raise an exception or issue a log message there.

Users might probably want to configure the behaviour for the different kinds of exceptions, e.g. ignore CommClosed but raise on KeyError, etc.

The text was updated successfully, but these errors were encountered:

mrocklin · 2021-08-06T15:29:59Z

That sounds like a good idea to me.

An alternative (not necessarily better) would be to not use add_callback, but instead to track all asyncio futures/tasks that we create and manage them explicitly.

fjetter · 2021-08-06T15:43:31Z

but instead to track all asyncio futures/tasks that we create and manage them explicitly.

I also thought about this but I was a bit scared since we're using this at 83 different places and I don't know if this transition would be smooth. Still an option. Note, though, the default asyncio interface like call_later, call_at, etc. would still need something and I do not know how PCs play into this. I'll do a bit of research

fjetter · 2021-08-09T17:58:43Z

I tried out a few things. The loop.set_exception_handler is a quite powerful machinery and works exactly as one might imagine. The handler is even called for "never awaited" cases which I think is nice.

# Example ctx
{'message': 'Task exception was never retrieved', 'exception': RuntimeError('task'), 'future': <Task finished name='Task-3' coro=<raise_exception_async() done, defined at /Users/fjetter/workspace/distributed-main/asyncio_exception_handler.py:9> exception=RuntimeError('task')>}

{'message': "Exception in callback raise_exception('call_later') at /Users/fjetter/workspace/distributed-main/asyncio_exception_handler.py:4", 'exception': RuntimeError('call_later'), 'handle': <TimerHandle when=0.1635665 raise_exception('call_later') at /Users/fjetter/workspace/distributed-main/asyncio_exception_handler.py:4>}

A big downside to all of this is that as long as we're using tornado to schedule our callbacks/futures and this includes PeriodicCallbacks, this will not work.
Tornado callbacks are asyncio.Futures with a done callback. The done callback will ensure that the future will always be awaited and the exception is always handled (and logged), such that the asyncio exception handler is never invoked.

mrocklin · 2021-08-09T18:05:47Z

We can switch out loop.add_callback with asyncio.ensure_future pretty easily I think.

For PeriodicCallbacks we could probably make our own pretty easily. Here is a Stack Overflow answer: https://stackoverflow.com/a/37514633

fjetter · 2021-08-13T16:07:34Z

cc @gjoseph92 you expressed interest in "tornado and async-stuff"

mrocklin · 2021-08-16T16:21:14Z

Regarding the question in #5217 about whether or not to use PubSub.

Yes, I agree that reinventing pubsub would not be great. Some reasons I might avoid using PubSub here.

My main motivation is mostly for these things to live on the scheduler in deques. I care more about the dashboard experience than about piping things down the client. The current PubSub system intentionally bypasses the scheduler in order to keep things peer-to-peer. The worker<->client relationship is the one place where this is broken. I view this as less important than the worker<-> scheduler relationship.

Additionally, a bit smaller, but we'd want this to be especially fast and low overhead for lots of small msgpack-serializable objects. PubSub was designed more for larger dask-serializable objects.

If we can extend pubsub to deal with all this well then I'm in favor. I view those things as being ...

Add an option for the scheduler to subscribe
Add buffers on the scheduler side
Verify that we're not significantly slower in the case where there is just one publisher and no subscribers (other than the scheduler).
Verify that if we send msgpack serializable data to the scheduler that it's easy to read there

mrocklin · 2021-08-16T16:24:11Z

Maybe the original idea of "make it easy for clients to subscribe to a feed of events from the scheduler on log_event" was overly ambitious.

We already have a mechanism for the scheduler to send updates to the client. Maybe there should be a Client._stream_handler that is "warn" or "print" that just issues a warning or prints something. That might be a good general tool to have, and it's easy to build on what we already use day-to-day.

Then in Scheduler.log_event we track a few names and issue warnings or print statements down to the client in those cases.

fjetter · 2021-08-16T17:04:10Z

Client._stream_handler that is "warn" or "print" that just issues a warning or prints something. That might be a good general tool to have, and it's easy to build on what we already use day-to-day.

I think what I have over in #5217 is not much more complicated but has a different POV. instead of logging I went for a handler pattern. I wanted to avoid any kind of overly verbose logging

FWIW getting the pubsub thing to work is partially done in #5218 there are still a few open questions, though

Add an option for the scheduler to subscribe

I special cased this with a boolean toggle where events would directly go to the event deques

Add buffers on the scheduler side

What kind of buffers do you mean?

Verify that we're not significantly slower in the case where there is just one publisher and no subscribers (other than the scheduler).

I'd be surprised to see a performance impact since regardless of the implementation, there will be one msg from Worker to Scheduler (assuming no more subscribers).
It will make a dent in scheduler performance once the client subscribes since we'll need to forward msgs there but this is one of the reasons why I don't want to blindly forward everything.

Verify that if we send msgpack serializable data to the scheduler that it's easy to read there

That's one of my current roadblocks, mostly since I do not know what the intention is here and/or I encountered a bug (to_serialize({"my": "message"}) breaks the remote but that's either a minor bug or some missing plumbing on my end)

fjetter · 2021-09-08T15:15:40Z

Another possibility to implement this in a quick and less sophisticated way would be to register log handlers and keep tornado. The log handlers would only trigger on exception level logs and would call into the log_event of #5217

fjetter mentioned this issue Aug 6, 2021

Worker state machine refactor #5046

Merged

7 tasks

fjetter mentioned this issue Aug 10, 2021

First iteration for exception forwarding #5196

Closed

fjetter mentioned this issue Aug 16, 2021

Allow Client to subscribe to events // Remote printing and warning #5217

Merged

fjetter mentioned this issue Aug 16, 2021

WIP Subscribe events - PubSub Extension #5218

Closed

fjetter mentioned this issue Aug 18, 2021

Worker has no exceptions attribute anymore #5225

Closed

This was referenced Sep 8, 2021

Memory prioritization on workers #5250

Open

Error logs in event system #5307

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Idea/Draft/Proposal] Exception handling for server exceptions #5184

[Idea/Draft/Proposal] Exception handling for server exceptions #5184

fjetter commented Aug 6, 2021

mrocklin commented Aug 6, 2021

fjetter commented Aug 6, 2021 •

edited

Loading

fjetter commented Aug 9, 2021

mrocklin commented Aug 9, 2021

fjetter commented Aug 13, 2021

mrocklin commented Aug 16, 2021

mrocklin commented Aug 16, 2021

fjetter commented Aug 16, 2021

fjetter commented Sep 8, 2021

[Idea/Draft/Proposal] Exception handling for server exceptions #5184

[Idea/Draft/Proposal] Exception handling for server exceptions #5184

Comments

fjetter commented Aug 6, 2021

mrocklin commented Aug 6, 2021

fjetter commented Aug 6, 2021 • edited Loading

fjetter commented Aug 9, 2021

mrocklin commented Aug 9, 2021

fjetter commented Aug 13, 2021

mrocklin commented Aug 16, 2021

mrocklin commented Aug 16, 2021

fjetter commented Aug 16, 2021

fjetter commented Sep 8, 2021

fjetter commented Aug 6, 2021 •

edited

Loading