Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheduler plugin to dump cluster state on close #5660

Closed
gjoseph92 opened this issue Jan 13, 2022 · 5 comments · Fixed by #5983
Closed

Scheduler plugin to dump cluster state on close #5660

gjoseph92 opened this issue Jan 13, 2022 · 5 comments · Fixed by #5983
Assignees
Labels
diagnostics good second issue Clearly described, educational, but less trivial than "good first issue".

Comments

@gjoseph92
Copy link
Collaborator

We've run into issues where the scheduler unexpectedly cleanly shuts itself down after running for a very long time. Having a dump of cluster state would help to debug this.

After #5659 is implemented, write a SchedulerPlugin with a close hook that dumps cluster state. The filename where the state is written can be either passed into the plugin instance.

If the cluster state dump fails, or writing to the destination fails, this should not affect the shutdown process—just log the problem and move on.

@gjoseph92 gjoseph92 added diagnostics good second issue Clearly described, educational, but less trivial than "good first issue". labels Jan 13, 2022
@gjoseph92
Copy link
Collaborator Author

Actually, a SchedulerPlugin may not work for debugging #5675. The close callback on plugins doesn't run until after the workers have already been removed:

async def close(self):
"""Run when the scheduler closes down
This runs at the beginning of the Scheduler shutdown process, but after
workers have been asked to shut down gracefully

We could, however, use a preload, since teardown() gets called before any workers are removed:

for preload in self.preloads:
await preload.teardown()
if close_workers:
await self.broadcast(msg={"op": "close_gracefully"}, nanny=True)

A preload will be a bit less ergonomic to install, though.

@ian-r-rose
Copy link
Collaborator

Actually, a SchedulerPlugin may not work for debugging #5675. The close callback on plugins doesn't run until after the workers have already been removed:

We could also add a before_close() plugin hook. I've also run into this issue before in a different context, and adding some more granularity to plugin lifecycles is pretty cheap.

@jrbourbeau
Copy link
Member

Yeah, if adding a hook for before workers have been removed would be useful for other problems, then adding a new method seems sensible

@gjoseph92
Copy link
Collaborator Author

Sounds good. Looking back through blames, it seems that scheduler plugin behavior has always been this way, but without much explanation of why. Having this option would probably be useful in many cases (especially if you're doing something dangerous and have worker plugins doing some stateful thing, and want to communicate to your worker plugins before shutdown).

@gjoseph92
Copy link
Collaborator Author

Hm, I don't know what I was talking about here.

This runs at the beginning of the Scheduler shutdown process, but after workers have been asked to shut down gracefully

is only the case if close_workers=True is passed to close. This doesn't happen when invoked from the idle timeout (I don't know why—I feel like close_workers should be true if the cluster is shutting down due to idleness). It seems to only ever be set to True if Client.shutdown() is called and the client doesn't have a cluster object attached.

So though I'm not sure if close_workers actually should be True for the idle timeout, for the purposes of debugging #5675, we don't a before_close hook or anything like that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
diagnostics good second issue Clearly described, educational, but less trivial than "good first issue".
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants