Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Close worker when we discover an internal error #6206

Closed
mrocklin opened this issue Apr 26, 2022 · 5 comments · Fixed by #6210
Closed

Close worker when we discover an internal error #6206

mrocklin opened this issue Apr 26, 2022 · 5 comments · Fixed by #6210
Assignees

Comments

@mrocklin
Copy link
Member

mrocklin commented Apr 26, 2022

There have been a few instances where a worker's internal state has been discovered to be inconsistent. This might be a validation error or an invalid transition or some other internal error. We should resolve these and determine what is going on. In the meantime it's also critical that we provide a stable experience to users. These situations are rare enough that it probably makes sense to just close down the worker and have it be restarted, and hope that the scheduler logic (which appears to be in a better state today) cleans things up properly.

This doesn't solve the underlying problem, but does give us a little space and gives users a better experience while we discover and resolve that problem.

@fjetter
Copy link
Member

fjetter commented Apr 26, 2022

+1

I like failing hard and loud. I think even after stabilizing this is a good way to go. If an internal exception occurs we do not have any way to resolving it but restarting the worker. We should ensure to leave enough logging information, if possible.

@mrocklin
Copy link
Member Author

Yeah, I'll probably send a packet of information to the scheduler using the eventing system, like what we do with invalid transitions and task states. Probably we should consider shipping those to the client by default as well.

@mrocklin
Copy link
Member Author

POC raised here for broad conceptual feedback: #6210

@gjoseph92
Copy link
Collaborator

xref #6201 for why this is a little tricky, and why it isn't happening already. It's awkward that:

  • you have to explicitly mark which coroutines should have this error propagation behavior
  • I don't think Worker.close() will actually cancel other running coroutines, they'll just get eventually leaked? (Or error out as things they were relying on, like RPCs, shut down.)

But getting closer to the more structured behavior still is a good step.

@mrocklin
Copy link
Member Author

you have to explicitly mark which coroutines should have this error propagation behavior

I agree that this is not optimal. I also think that it's fine though.

I don't think Worker.close() will actually cancel other running coroutines, they'll just get eventually leaked?

I think that it's not critical that we'll be leaking running coroutines. My guess is that calling worker.close will cause the worker to close fairly reliably.

But getting closer to the more structured behavior still is a good step

Yeah, to be clear I don't mean to say "this will solve all of our problems all of the time". I'm saying "this seems like a really easy and possibly large win for us".

I'm happy that folks are thinking long-term, but I also want to make sure that we get to a happy place relatively quickly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants