Close worker when we discover an internal error #6206

mrocklin · 2022-04-26T13:33:27Z

There have been a few instances where a worker's internal state has been discovered to be inconsistent. This might be a validation error or an invalid transition or some other internal error. We should resolve these and determine what is going on. In the meantime it's also critical that we provide a stable experience to users. These situations are rare enough that it probably makes sense to just close down the worker and have it be restarted, and hope that the scheduler logic (which appears to be in a better state today) cleans things up properly.

This doesn't solve the underlying problem, but does give us a little space and gives users a better experience while we discover and resolve that problem.

fjetter · 2022-04-26T13:42:11Z

+1

I like failing hard and loud. I think even after stabilizing this is a good way to go. If an internal exception occurs we do not have any way to resolving it but restarting the worker. We should ensure to leave enough logging information, if possible.

mrocklin · 2022-04-26T13:44:46Z

Yeah, I'll probably send a packet of information to the scheduler using the eventing system, like what we do with invalid transitions and task states. Probably we should consider shipping those to the client by default as well.

mrocklin · 2022-04-26T15:31:15Z

POC raised here for broad conceptual feedback: #6210

gjoseph92 · 2022-04-26T15:42:07Z

xref #6201 for why this is a little tricky, and why it isn't happening already. It's awkward that:

you have to explicitly mark which coroutines should have this error propagation behavior
I don't think Worker.close() will actually cancel other running coroutines, they'll just get eventually leaked? (Or error out as things they were relying on, like RPCs, shut down.)

But getting closer to the more structured behavior still is a good step.

mrocklin · 2022-04-26T15:47:23Z

you have to explicitly mark which coroutines should have this error propagation behavior

I agree that this is not optimal. I also think that it's fine though.

I don't think Worker.close() will actually cancel other running coroutines, they'll just get eventually leaked?

I think that it's not critical that we'll be leaking running coroutines. My guess is that calling worker.close will cause the worker to close fairly reliably.

But getting closer to the more structured behavior still is a good step

Yeah, to be clear I don't mean to say "this will solve all of our problems all of the time". I'm saying "this seems like a really easy and possibly large win for us".

I'm happy that folks are thinking long-term, but I also want to make sure that we get to a happy place relatively quickly.

mrocklin mentioned this issue Apr 26, 2022

Pass on in-flight transfers if they are already released #6199

Closed

fjetter linked a pull request Apr 27, 2022 that will close this issue

Add fail_hard decorator for worker methods #6210

Merged

fjetter assigned mrocklin Apr 27, 2022

fjetter closed this as completed in #6210 Apr 29, 2022

hendrikmakait mentioned this issue Jul 23, 2024

Shut down scheduler on unrecoverable exceptions #8792

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Close worker when we discover an internal error #6206

Close worker when we discover an internal error #6206

mrocklin commented Apr 26, 2022 •

edited

Loading

fjetter commented Apr 26, 2022

mrocklin commented Apr 26, 2022

mrocklin commented Apr 26, 2022

gjoseph92 commented Apr 26, 2022

mrocklin commented Apr 26, 2022

Close worker when we discover an internal error #6206

Close worker when we discover an internal error #6206

Comments

mrocklin commented Apr 26, 2022 • edited Loading

fjetter commented Apr 26, 2022

mrocklin commented Apr 26, 2022

mrocklin commented Apr 26, 2022

gjoseph92 commented Apr 26, 2022

mrocklin commented Apr 26, 2022

mrocklin commented Apr 26, 2022 •

edited

Loading