Use KernelManager's graceful shutdown rather than KILLing kernels #64

golf-player · 2020-05-19T22:40:43Z

I think the way it was implemented was possibly because of how earlier KernelManager versions behaved, but since that's pinned to >6.1, it should be fine to leverage the graceful shutdown methods available.

KernelManager's graceful shutdown first does the equivalent of KernelClient.shutdown() ie. sending a shutdown message using the shutdown. Then it polls the kernel to see if it's still alive, and has a configurable timeout after which, it force KILLs the kernel. So the behaviour here shouldn't really change outside the extra wait time waiting for things to shut down gracefully (which possibly fixes things that had been breaking)

nbclient/client.py

davidbrochart · 2020-05-20T06:28:04Z

LGTM.

nbclient/client.py

MSeal · 2020-05-20T22:00:06Z

nbclient/client.py

-                if 'No kernel is running!' not in str(e):
-                    raise
+            # Queue the manager to kill the process, and recover gracefully if it's already dead.
+            await ensure_async(self.km.shutdown_kernel(now=now))


I think this will be ok. I would make sure you test the parallel execution and ensure that A) the graceful shutdown is degrading to ungraceful if it take too long and B) the kernels are shutting down without zombie processes left over when graceful exit has to revert to ungraceful. There used to be a lot of issues with graceful kernel shutdown in the headless executions as they weren't always cleaning up before the parent process exited. e.g. with a papermil CLI call

If you can do some quick testing with parallel processing and make a situation where the kernel won't exit cleanly (e.g. the kernel is busy executing something else) that'd be ideal before merging.

I've verified the graceful shutdown defaults to ungraceful by using a notebook with an atexit that sleeps, and also a thread that sleeps and doesn't join. Both cause the ungraceful SIGKILL. In neither case were there zombie processes; that said, if the notebook itself was starting up multiple processes and making them not clean up right, you could end up with zombies, but that's always been the case.

If a papermill CLI call or just direct nbclient call is made, and gets interrupted, I don't believe the exit is graceful. I thought the context manager would handle this, but it seems not. I'll look into why that is. On KeyboardInterrupt, the finally block isn't reached, it appears

Not sure what "parallel execution" means. Could you please explain what you meant there?

TIL asyncio and signal handling don't go together easily

Thanks for doing some testing here. By parallel execution I mean something like:

import multiprocessing def worker(): # or just nbclient directly here tbh return pm.execute_notebook(input_path, output_path) if __name__ == '__main__': jobs = [] for i in range(5): p = multiprocessing.Process(target=worker) jobs.append(p) p.start() for job in jobs: job.join()

where there's a subprocess boundary and some parallelism. There's been lots of issues here around cleanup and termination so I wanted to double check it's alright.

In the example, what would you worry about going wrong? If the notebook client isn't being shared, there shouldn't be any issue. If all you're asking is that I test that out, no problem, but I can't really see there being issues there if nothing gets interrupted or killed or whatever.

IMO, nbclient should either guarantee an attempt is made to shutdown kernels when INTed or TERMed, which is what I'm looking into currently. The mashup of async and non-async code is making this pretty tricky, and one problem is all the cleanup code requires the ioloop which is stopped when either of those signals are received

🤦 I've been mucking around with signals, but really atexit should do the job. But then I guess it would need to be in the main thread/process; also, iirc atexit doesn't play well with multiprocessing

TBH this way does have a lot of limitations. I'd love to hear your opinion here.

You're hitting all the reasons I didn't leave it graceful, unfortunately. atexit is best effort and does not run for subprocess calls. The testing parallel subworkflows is really about testing process exit patterns against the os._exit call it makes (which skips atexit). In python < 3.6 there were some rough edges that I remember being problematic under certain conditions as well. https://github.com/kuralabs/multiexit was an attempt to fix this but seems incomplete / only usable if you're guaranteed to be the entry process.

For async cleanup, maybe take a look at https://mail.python.org/pipermail/async-sig/2017-August/000374.html and see if that can be utilized here. Might simplify the problem substantially if so. Also it's ok if there's some extreme case where we can't cleanup -- you can't protect against some like OOM or SIGKILL -- but making an effort to reduce that to the fewest cases possible is worth it.

Yeah atexit is not a solution in case of multiple processes.

I had actually tried what's in that email for a while before giving up temporarily. What was happening was that the signal was somehow not being propagated after the signal handler was called, and it ended up with some weird exit behaviour like cleanup running while nbclient still executed the notebook, causing the execution to fail because the channels were closed and something timed out. I would have expected the signal to cause that to exit.

I think the ideal way to handle this would be for the context manager to behave as usual and have the finally block be called when the program is interrupted or terminated. The problem there is not clear, but it appears to be a combination of async working strangely with KeyboardInterrupt, and the run_sync decorator

golf-player · 2020-05-22T06:19:02Z

nbclient/client.py

@@ -425,6 +424,33 @@ def setup_kernel(self, **kwargs):
        finally:
            self._cleanup_kernel()

+    @contextmanager
+    def graceful_shutdown(self):


This approach is a bit ham-fisted, but it does seem to work. So in cases of the friendly signals, cleanup should happen. Anything else is unfortunately impossible to handle, and I can't imagine the old way of cleanup did that consistently (or at all).

What's annoying is that without the signal handler, the finally clause of async_setup_kernel doesn't get called, since apparently the interrupt goes straight to the ioloop and cancels all the tasks on it without raising exceptions inside functions (I'm learning more than I wanted to about asyncio...lol). But when the handler is used, it causes a separate exception in async_execute, which causes that finally clause to run, running cleanup twice. I can't catch the exception there since it's a generic concurrent.futures._base.CancelledError. There's probably a way around it, but idk enough about how asyncio works, and running cleanup multiple times is harmless honestly.

I'm only using atexit here since Windows doesn't do signals (may have to tweak this if the signal handler function doesn't even exist).

Yeah I agree that the double cleanup should be fairly harmless as the cleanup functions become no-ops.

Thanks so much for digging deep into this. It is appreciated checking these conditions thoroughly!

Not exactly a no-op, since the shutdown message is still sent, but that's not really worth worrying about IMO.

…default when reset_kc is True

MSeal · 2020-05-22T16:53:23Z

nbclient/client.py

+            try:
+                yield
+            finally:
+                if cleanup_kc:


Probably can just move this into the graceful shutdown context wrapper now?

good point. tbh I don't see a reason for graceful shutdown to be a separate function.

I'm also intentionally not respecting cleanup_kc in case of unexpected shutdown.

MSeal · 2020-05-26T16:47:07Z

@golf-player looks like some code path is still hanging indefinitely in test_kernel_death. We were looking at doing a 0.4 release and this would be good to have in it imo. Do you need help testing this or do you have an idea of what's going wrong?

golf-player · 2020-05-26T16:49:56Z

Yeah I saw that; I'll have some time to check it out tonight. Will get back to you. Any plans for when the release will be made? Also, is there a branch I can look at to make sure there's no conflicts?

…

On Tue, May 26, 2020, 11:47 Matthew Seal ***@***.***> wrote: @golf-player <https://github.com/golf-player> looks like some code path is still hanging indefinitely in test_kernel_death. We were looking at doing a 0.4 release and this would be good to have in it imo. Do you need help testing this or do you have an idea of what's going wrong? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#64 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AIQ6EKV34G465EMK5VWKXCLRTPXB3ANCNFSM4NFNARBA> .

MSeal · 2020-05-26T16:51:54Z

There is one conflict -- I think all the major changes are merged to master except this one

golf-player · 2020-05-27T05:13:16Z

OK I see the issue. In this test, the kernel is said to be dead, and the manager code has some annoying behaviour which I guess I'll ask about in that project.

Basically here: https://github.com/jupyter/jupyter_client/blob/6.1.3/jupyter_client/manager.py#L331, and here: https://github.com/jupyter/jupyter_client/blob/6.1.3/jupyter_client/manager.py#L561 the finish_shutdown function doesn't set a timeout on the kernel.wait() call, which it should. In the offending test, the kernel.wait blocked everything forever waiting on an infinite loop, which, IMO shouldn't happen and a kill signal should be sent to the kernel process.

I can get around this by only shutting down the km if it's alive, but I'd prefer to make the behaviour better in jupyter_client. I'll do that for now though since it's a harmless thing to add, and I don't think changing the test makes much sense either.

edit: on looking at the code again, the code is fine, and I don't believe the scenario in the test can be reproduced

golf-player · 2020-05-27T05:44:58Z

Seems the windows test is broken everywhere, but otherwise tests look good.

MSeal

Thanks! Let's get this merged and keep an eye out for any odd issues that may pop up with the better management changes here.

golf-player · 2020-05-30T07:34:19Z

Thanks for the help and review @MSeal and @kevin-bates

golf-player mentioned this pull request May 19, 2020

Allow the kernel manager to shut down gracefully in _async_cleanup_kernel #60

Closed

golf-player commented May 19, 2020

View reviewed changes

nbclient/client.py Show resolved Hide resolved

MSeal reviewed May 20, 2020

View reviewed changes

golf-player commented May 22, 2020

View reviewed changes

Golf Player added 4 commits May 22, 2020 11:27

Use KernelManager's graceful shutdown rather than KILLing kernels by …

b23ca88

…default when reset_kc is True

Typo

626abbf

Add atexit handler to clean up kernels

3205248

Handle sigINT and sigTERM

ea688b7

golf-player force-pushed the master branch from 1990d03 to ea688b7 Compare May 22, 2020 16:33

MSeal reviewed May 22, 2020

View reviewed changes

Golf Player added 2 commits May 22, 2020 14:43

Merge graceful_shutdown with the async_setup_kernel

c4f32a0

Signal handling fails when run on a thread

06ea4e1

MSeal mentioned this pull request May 26, 2020

feat: implement Output widget that mimics a frontend #68

Merged

kevin-bates mentioned this pull request May 26, 2020

1.0 Release Goals #69

Open

Merge remote-tracking branch 'upstream/master'

fb1960b

Golf Player added 2 commits May 27, 2020 00:13

Only shutdown the kernelmanager if it's not alive still

0e42641

self.kc is not always defined in KernelManager

d5333e8

MSeal approved these changes May 29, 2020

View reviewed changes

MSeal merged commit 42d2c2c into jupyter:master May 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use KernelManager's graceful shutdown rather than KILLing kernels #64

Use KernelManager's graceful shutdown rather than KILLing kernels #64

golf-player commented May 19, 2020

davidbrochart commented May 20, 2020

MSeal May 20, 2020

MSeal May 20, 2020

golf-player May 21, 2020 •

edited

Loading

golf-player May 21, 2020 •

edited

Loading

MSeal May 21, 2020

golf-player May 21, 2020

golf-player May 21, 2020 •

edited

Loading

MSeal May 21, 2020

golf-player May 21, 2020

golf-player May 22, 2020 •

edited

Loading

MSeal May 22, 2020

golf-player May 22, 2020

MSeal May 22, 2020

golf-player May 22, 2020

golf-player May 22, 2020

MSeal commented May 26, 2020

golf-player commented May 26, 2020 via email

MSeal commented May 26, 2020

golf-player commented May 27, 2020 •

edited

Loading

golf-player commented May 27, 2020

MSeal left a comment

golf-player commented May 30, 2020

Use KernelManager's graceful shutdown rather than KILLing kernels #64

Use KernelManager's graceful shutdown rather than KILLing kernels #64

Conversation

golf-player commented May 19, 2020

davidbrochart commented May 20, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

golf-player May 21, 2020 • edited Loading

Choose a reason for hiding this comment

golf-player May 21, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

golf-player May 21, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

golf-player May 22, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MSeal commented May 26, 2020

golf-player commented May 26, 2020 via email

MSeal commented May 26, 2020

golf-player commented May 27, 2020 • edited Loading

golf-player commented May 27, 2020

MSeal left a comment

Choose a reason for hiding this comment

golf-player commented May 30, 2020

golf-player May 21, 2020 •

edited

Loading

golf-player May 21, 2020 •

edited

Loading

golf-player May 21, 2020 •

edited

Loading

golf-player May 22, 2020 •

edited

Loading

golf-player commented May 27, 2020 •

edited

Loading