Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use KernelManager's graceful shutdown rather than KILLing kernels #64

Merged
merged 9 commits into from
May 29, 2020

Conversation

golf-player
Copy link
Contributor

I think the way it was implemented was possibly because of how earlier KernelManager versions behaved, but since that's pinned to >6.1, it should be fine to leverage the graceful shutdown methods available.

KernelManager's graceful shutdown first does the equivalent of KernelClient.shutdown() ie. sending a shutdown message using the shutdown. Then it polls the kernel to see if it's still alive, and has a configurable timeout after which, it force KILLs the kernel. So the behaviour here shouldn't really change outside the extra wait time waiting for things to shut down gracefully (which possibly fixes things that had been breaking)

@davidbrochart
Copy link
Member

LGTM.

nbclient/client.py Show resolved Hide resolved
if 'No kernel is running!' not in str(e):
raise
# Queue the manager to kill the process, and recover gracefully if it's already dead.
await ensure_async(self.km.shutdown_kernel(now=now))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will be ok. I would make sure you test the parallel execution and ensure that A) the graceful shutdown is degrading to ungraceful if it take too long and B) the kernels are shutting down without zombie processes left over when graceful exit has to revert to ungraceful. There used to be a lot of issues with graceful kernel shutdown in the headless executions as they weren't always cleaning up before the parent process exited. e.g. with a papermil CLI call

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you can do some quick testing with parallel processing and make a situation where the kernel won't exit cleanly (e.g. the kernel is busy executing something else) that'd be ideal before merging.

Copy link
Contributor Author

@golf-player golf-player May 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've verified the graceful shutdown defaults to ungraceful by using a notebook with an atexit that sleeps, and also a thread that sleeps and doesn't join. Both cause the ungraceful SIGKILL. In neither case were there zombie processes; that said, if the notebook itself was starting up multiple processes and making them not clean up right, you could end up with zombies, but that's always been the case.

If a papermill CLI call or just direct nbclient call is made, and gets interrupted, I don't believe the exit is graceful. I thought the context manager would handle this, but it seems not. I'll look into why that is. On KeyboardInterrupt, the finally block isn't reached, it appears

Not sure what "parallel execution" means. Could you please explain what you meant there?

Copy link
Contributor Author

@golf-player golf-player May 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL asyncio and signal handling don't go together easily

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing some testing here. By parallel execution I mean something like:

import multiprocessing

def worker():
    # or just nbclient directly here tbh
    return pm.execute_notebook(input_path, output_path)

if __name__ == '__main__':
    jobs = []
    for i in range(5):
        p = multiprocessing.Process(target=worker)
        jobs.append(p)
        p.start()
    for job in jobs:
      job.join()

where there's a subprocess boundary and some parallelism. There's been lots of issues here around cleanup and termination so I wanted to double check it's alright.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the example, what would you worry about going wrong? If the notebook client isn't being shared, there shouldn't be any issue. If all you're asking is that I test that out, no problem, but I can't really see there being issues there if nothing gets interrupted or killed or whatever.

IMO, nbclient should either guarantee an attempt is made to shutdown kernels when INTed or TERMed, which is what I'm looking into currently. The mashup of async and non-async code is making this pretty tricky, and one problem is all the cleanup code requires the ioloop which is stopped when either of those signals are received

Copy link
Contributor Author

@golf-player golf-player May 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤦 I've been mucking around with signals, but really atexit should do the job. But then I guess it would need to be in the main thread/process; also, iirc atexit doesn't play well with multiprocessing

TBH this way does have a lot of limitations. I'd love to hear your opinion here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're hitting all the reasons I didn't leave it graceful, unfortunately. atexit is best effort and does not run for subprocess calls. The testing parallel subworkflows is really about testing process exit patterns against the os._exit call it makes (which skips atexit). In python < 3.6 there were some rough edges that I remember being problematic under certain conditions as well. https://github.com/kuralabs/multiexit was an attempt to fix this but seems incomplete / only usable if you're guaranteed to be the entry process.

For async cleanup, maybe take a look at https://mail.python.org/pipermail/async-sig/2017-August/000374.html and see if that can be utilized here. Might simplify the problem substantially if so. Also it's ok if there's some extreme case where we can't cleanup -- you can't protect against some like OOM or SIGKILL -- but making an effort to reduce that to the fewest cases possible is worth it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah atexit is not a solution in case of multiple processes.

I had actually tried what's in that email for a while before giving up temporarily. What was happening was that the signal was somehow not being propagated after the signal handler was called, and it ended up with some weird exit behaviour like cleanup running while nbclient still executed the notebook, causing the execution to fail because the channels were closed and something timed out. I would have expected the signal to cause that to exit.

I think the ideal way to handle this would be for the context manager to behave as usual and have the finally block be called when the program is interrupted or terminated. The problem there is not clear, but it appears to be a combination of async working strangely with KeyboardInterrupt, and the run_sync decorator

@@ -425,6 +424,33 @@ def setup_kernel(self, **kwargs):
finally:
self._cleanup_kernel()

@contextmanager
def graceful_shutdown(self):
Copy link
Contributor Author

@golf-player golf-player May 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach is a bit ham-fisted, but it does seem to work. So in cases of the friendly signals, cleanup should happen. Anything else is unfortunately impossible to handle, and I can't imagine the old way of cleanup did that consistently (or at all).

What's annoying is that without the signal handler, the finally clause of async_setup_kernel doesn't get called, since apparently the interrupt goes straight to the ioloop and cancels all the tasks on it without raising exceptions inside functions (I'm learning more than I wanted to about asyncio...lol). But when the handler is used, it causes a separate exception in async_execute, which causes that finally clause to run, running cleanup twice. I can't catch the exception there since it's a generic concurrent.futures._base.CancelledError. There's probably a way around it, but idk enough about how asyncio works, and running cleanup multiple times is harmless honestly.

I'm only using atexit here since Windows doesn't do signals (may have to tweak this if the signal handler function doesn't even exist).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I agree that the double cleanup should be fairly harmless as the cleanup functions become no-ops.

Thanks so much for digging deep into this. It is appreciated checking these conditions thoroughly!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not exactly a no-op, since the shutdown message is still sent, but that's not really worth worrying about IMO.

try:
yield
finally:
if cleanup_kc:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably can just move this into the graceful shutdown context wrapper now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point. tbh I don't see a reason for graceful shutdown to be a separate function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also intentionally not respecting cleanup_kc in case of unexpected shutdown.

@MSeal
Copy link
Contributor

MSeal commented May 26, 2020

@golf-player looks like some code path is still hanging indefinitely in test_kernel_death. We were looking at doing a 0.4 release and this would be good to have in it imo. Do you need help testing this or do you have an idea of what's going wrong?

@golf-player
Copy link
Contributor Author

golf-player commented May 26, 2020 via email

@MSeal
Copy link
Contributor

MSeal commented May 26, 2020

There is one conflict -- I think all the major changes are merged to master except this one

@golf-player
Copy link
Contributor Author

golf-player commented May 27, 2020

OK I see the issue. In this test, the kernel is said to be dead, and the manager code has some annoying behaviour which I guess I'll ask about in that project.

Basically here: https://github.com/jupyter/jupyter_client/blob/6.1.3/jupyter_client/manager.py#L331, and here: https://github.com/jupyter/jupyter_client/blob/6.1.3/jupyter_client/manager.py#L561 the finish_shutdown function doesn't set a timeout on the kernel.wait() call, which it should. In the offending test, the kernel.wait blocked everything forever waiting on an infinite loop, which, IMO shouldn't happen and a kill signal should be sent to the kernel process.

I can get around this by only shutting down the km if it's alive, but I'd prefer to make the behaviour better in jupyter_client. I'll do that for now though since it's a harmless thing to add, and I don't think changing the test makes much sense either.

edit: on looking at the code again, the code is fine, and I don't believe the scenario in the test can be reproduced

@golf-player
Copy link
Contributor Author

Seems the windows test is broken everywhere, but otherwise tests look good.

Copy link
Contributor

@MSeal MSeal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Let's get this merged and keep an eye out for any odd issues that may pop up with the better management changes here.

@MSeal MSeal merged commit 42d2c2c into jupyter:master May 29, 2020
@golf-player
Copy link
Contributor Author

Thanks for the help and review @MSeal and @kevin-bates

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants