-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use KernelManager's graceful shutdown rather than KILLing kernels #64
Conversation
LGTM. |
nbclient/client.py
Outdated
if 'No kernel is running!' not in str(e): | ||
raise | ||
# Queue the manager to kill the process, and recover gracefully if it's already dead. | ||
await ensure_async(self.km.shutdown_kernel(now=now)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this will be ok. I would make sure you test the parallel execution and ensure that A) the graceful shutdown is degrading to ungraceful if it take too long and B) the kernels are shutting down without zombie processes left over when graceful exit has to revert to ungraceful. There used to be a lot of issues with graceful kernel shutdown in the headless executions as they weren't always cleaning up before the parent process exited. e.g. with a papermil CLI call
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you can do some quick testing with parallel processing and make a situation where the kernel won't exit cleanly (e.g. the kernel is busy executing something else) that'd be ideal before merging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've verified the graceful shutdown defaults to ungraceful by using a notebook with an atexit that sleeps, and also a thread that sleeps and doesn't join. Both cause the ungraceful SIGKILL. In neither case were there zombie processes; that said, if the notebook itself was starting up multiple processes and making them not clean up right, you could end up with zombies, but that's always been the case.
If a papermill CLI call or just direct nbclient call is made, and gets interrupted, I don't believe the exit is graceful. I thought the context manager would handle this, but it seems not. I'll look into why that is. On KeyboardInterrupt, the finally block isn't reached, it appears
Not sure what "parallel execution" means. Could you please explain what you meant there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL asyncio and signal handling don't go together easily
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing some testing here. By parallel execution I mean something like:
import multiprocessing
def worker():
# or just nbclient directly here tbh
return pm.execute_notebook(input_path, output_path)
if __name__ == '__main__':
jobs = []
for i in range(5):
p = multiprocessing.Process(target=worker)
jobs.append(p)
p.start()
for job in jobs:
job.join()
where there's a subprocess boundary and some parallelism. There's been lots of issues here around cleanup and termination so I wanted to double check it's alright.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the example, what would you worry about going wrong? If the notebook client isn't being shared, there shouldn't be any issue. If all you're asking is that I test that out, no problem, but I can't really see there being issues there if nothing gets interrupted or killed or whatever.
IMO, nbclient should either guarantee an attempt is made to shutdown kernels when INTed or TERMed, which is what I'm looking into currently. The mashup of async and non-async code is making this pretty tricky, and one problem is all the cleanup code requires the ioloop which is stopped when either of those signals are received
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤦 I've been mucking around with signals, but really atexit should do the job. But then I guess it would need to be in the main thread/process; also, iirc atexit doesn't play well with multiprocessing
TBH this way does have a lot of limitations. I'd love to hear your opinion here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're hitting all the reasons I didn't leave it graceful, unfortunately. atexit
is best effort and does not run for subprocess calls. The testing parallel subworkflows is really about testing process exit patterns against the os._exit call it makes (which skips atexit). In python < 3.6 there were some rough edges that I remember being problematic under certain conditions as well. https://github.com/kuralabs/multiexit was an attempt to fix this but seems incomplete / only usable if you're guaranteed to be the entry process.
For async cleanup, maybe take a look at https://mail.python.org/pipermail/async-sig/2017-August/000374.html and see if that can be utilized here. Might simplify the problem substantially if so. Also it's ok if there's some extreme case where we can't cleanup -- you can't protect against some like OOM or SIGKILL -- but making an effort to reduce that to the fewest cases possible is worth it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah atexit is not a solution in case of multiple processes.
I had actually tried what's in that email for a while before giving up temporarily. What was happening was that the signal was somehow not being propagated after the signal handler was called, and it ended up with some weird exit behaviour like cleanup running while nbclient still executed the notebook, causing the execution to fail because the channels were closed and something timed out. I would have expected the signal to cause that to exit.
I think the ideal way to handle this would be for the context manager to behave as usual and have the finally block be called when the program is interrupted or terminated. The problem there is not clear, but it appears to be a combination of async working strangely with KeyboardInterrupt, and the run_sync decorator
nbclient/client.py
Outdated
@@ -425,6 +424,33 @@ def setup_kernel(self, **kwargs): | |||
finally: | |||
self._cleanup_kernel() | |||
|
|||
@contextmanager | |||
def graceful_shutdown(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This approach is a bit ham-fisted, but it does seem to work. So in cases of the friendly signals, cleanup should happen. Anything else is unfortunately impossible to handle, and I can't imagine the old way of cleanup did that consistently (or at all).
What's annoying is that without the signal handler, the finally clause of async_setup_kernel
doesn't get called, since apparently the interrupt goes straight to the ioloop and cancels all the tasks on it without raising exceptions inside functions (I'm learning more than I wanted to about asyncio...lol). But when the handler is used, it causes a separate exception in async_execute
, which causes that finally clause to run, running cleanup twice. I can't catch the exception there since it's a generic concurrent.futures._base.CancelledError
. There's probably a way around it, but idk enough about how asyncio works, and running cleanup multiple times is harmless honestly.
I'm only using atexit
here since Windows doesn't do signals (may have to tweak this if the signal handler function doesn't even exist).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I agree that the double cleanup should be fairly harmless as the cleanup functions become no-ops.
Thanks so much for digging deep into this. It is appreciated checking these conditions thoroughly!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not exactly a no-op, since the shutdown message is still sent, but that's not really worth worrying about IMO.
nbclient/client.py
Outdated
try: | ||
yield | ||
finally: | ||
if cleanup_kc: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably can just move this into the graceful shutdown context wrapper now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point. tbh I don't see a reason for graceful shutdown to be a separate function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also intentionally not respecting cleanup_kc in case of unexpected shutdown.
@golf-player looks like some code path is still hanging indefinitely in |
Yeah I saw that; I'll have some time to check it out tonight. Will get back
to you.
Any plans for when the release will be made? Also, is there a branch I can
look at to make sure there's no conflicts?
…On Tue, May 26, 2020, 11:47 Matthew Seal ***@***.***> wrote:
@golf-player <https://github.com/golf-player> looks like some code path
is still hanging indefinitely in test_kernel_death. We were looking at
doing a 0.4 release and this would be good to have in it imo. Do you need
help testing this or do you have an idea of what's going wrong?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#64 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AIQ6EKV34G465EMK5VWKXCLRTPXB3ANCNFSM4NFNARBA>
.
|
There is one conflict -- I think all the major changes are merged to master except this one |
OK I see the issue. In this test, the kernel is said to be dead, and the manager code has some annoying behaviour which I guess I'll ask about in that project. Basically here: https://github.com/jupyter/jupyter_client/blob/6.1.3/jupyter_client/manager.py#L331, and here: https://github.com/jupyter/jupyter_client/blob/6.1.3/jupyter_client/manager.py#L561 the finish_shutdown function doesn't set a timeout on the kernel.wait() call, which it should. In the offending test, the kernel.wait blocked everything forever waiting on an infinite loop, which, IMO shouldn't happen and a kill signal should be sent to the kernel process. I can get around this by only shutting down the km if it's alive, but I'd prefer to make the behaviour better in jupyter_client. I'll do that for now though since it's a harmless thing to add, and I don't think changing the test makes much sense either. edit: on looking at the code again, the code is fine, and I don't believe the scenario in the test can be reproduced |
Seems the windows test is broken everywhere, but otherwise tests look good. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Let's get this merged and keep an eye out for any odd issues that may pop up with the better management changes here.
Thanks for the help and review @MSeal and @kevin-bates |
I think the way it was implemented was possibly because of how earlier KernelManager versions behaved, but since that's pinned to >6.1, it should be fine to leverage the graceful shutdown methods available.
KernelManager's graceful shutdown first does the equivalent of KernelClient.shutdown() ie. sending a shutdown message using the shutdown. Then it polls the kernel to see if it's still alive, and has a configurable timeout after which, it force KILLs the kernel. So the behaviour here shouldn't really change outside the extra wait time waiting for things to shut down gracefully (which possibly fixes things that had been breaking)