Attempt to fix FD leak issue #1051 #1054

rahul26goyal · 2022-03-25T15:11:48Z

Description

This PR is raised wrt the fix proposed in #1051 to wait on the Kernel Shutdown request until the kernel Restart is completed.

the fix is made only on EG as it depends on the Kernel field set with in the EG code only: kernel.restarting
There are some TODOs and open items which are open for discussion. (done)
The time used to wait for restart to complete is inherited / taken from "KERNEL_LAUNCH_TIMEOUT" as this is precisely the time which even the kernel start would wait before giving up on the kernel.

How does this prevent the FD leak mentioned on #1051 ?

As we are waiting on the restart to complete and no going to allow shutdown to happen in parallel, the chances of Shutdown flow terminating the kernel at the exact same time when Kernel restart is waiting on kernel_info is minimal.
This also prevents any other accidental leaks that might happen if a shutdown request is processed while Restart is happening.

TESTING

I have done some basic sanity testing with notebook and JupyterLab and it looks to be working fine.
Tested the changes on local python kernel and changes seems to be working as expected.
when a restart request is followed by shutdown, the shutdown request waits until the restart is completed and then shuts down the kernel cleanly.
when a restart request is followed by shutdown and the shutdown times out waiting, it proceeds to forceful shutdown. This was the same behaviour before the current changes were made.
when multiple restart request is received for the same kernel, the first shutdown request goes the actual restart while all the other duplicate request wait for the kernel.restarting to become False and return.
thanks!

Test Scenario : Kernel is restarted via "auto restarter". And then kernel restart request is sent.

# the kernel was already running and we forcefully killed the kernel process.
# this trigger the auto-restarter 
 
AsyncIOLoopKernelRestarter: restarting kernel (1/5), keep random ports
kernel 0fb50ca9-b8ee-4030-9207-958ff3bab86a restarted
 
restarting kernel with value for now: True

Instantiating kernel 'Python 3 (ipykernel)' with process proxy: enterprise_gateway.services.processproxies.processproxy.LocalProcessProxy
Starting kernel (async): ['/Users/rhgoyal/miniconda3/envs/enterprise-gateway-dev/bin/python', '-m', 'ipykernel_launcher', '-f', '/Users/rhgoyal/Library/Jupyter/runtime/kernel-0fb50ca9-b8ee-4030-9207-958ff3bab86a.json']
Launching kernel: 'Python 3 (ipykernel)' with command: ['/Users/rhgoyal/miniconda3/envs/enterprise-gateway-dev/bin/python', '-m', 'ipykernel_launcher', '-f', '/Users/rhgoyal/Library/Jupyter/runtime/kernel-0fb50ca9-b8ee-4030-9207-958ff3bab86a.json']
BaseProcessProxy.launch_process() env: {'PATH': '/usr/local/opt/[email protected]/bin:/Users/rhgoyal/miniconda3/envs/enterprise-gateway-dev/bin:/Users/rhgoyal/miniconda3/condabin:/Users/rhgoyal/miniconda3/bin:/Users/rhgoyal/learning/installs/apache-maven-3.6.3/bin:/Users/rhgoyal/.toolbox/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin', 'KERNEL_EXTRA_SPARK_OPTS': ' --conf spark.kubernetes.executor.label.component=worker ', 'KERNEL_LAUNCH_TIMEOUT': '150', 'KERNEL_WORKING_DIR': '/Users/rhgoyal/learning/k8s/jupyter', 'KERNEL_USERNAME': 'rhgoyal', 'KERNEL_GATEWAY': '1', 'KERNEL_ID': '0fb50ca9-b8ee-4030-9207-958ff3bab86a', 'KERNEL_LANGUAGE': 'python', 'EG_IMPERSONATION_ENABLED': 'False'}
Local kernel launched on '192.168.0.106', pid: 65658, pgid: 65658, KernelID: 0fb50ca9-b8ee-4030-9207-958ff3bab86a, cmd: '['/Users/rhgoyal/miniconda3/envs/enterprise-gateway-dev/bin/python', '-m', 'ipykernel_launcher', '-f', '/Users/rhgoyal/Library/Jupyter/runtime/kernel-0fb50ca9-b8ee-4030-9207-958ff3bab86a.json']'
Connecting to: tcp://127.0.0.1:51321
Refreshing kernel session for id: 0fb50ca9-b8ee-4030-9207-958ff3bab86a
activity on 0fb50ca9-b8ee-4030-9207-958ff3bab86a: status (busy)
activity on 0fb50ca9-b8ee-4030-9207-958ff3bab86a: status (idle)
AsyncIOLoopKernelRestarter: restart apparently succeeded



Websocket closed 0fb50ca9-b8ee-4030-9207-958ff3bab86a:d78a447e-56649b2aad7998e7efa537be
Starting buffering for 0fb50ca9-b8ee-4030-9207-958ff3bab86a:d78a447e-56649b2aad7998e7efa537be
Clearing buffer for 0fb50ca9-b8ee-4030-9207-958ff3bab86a
Initializing websocket connection /api/kernels/0fb50ca9-b8ee-4030-9207-958ff3bab86a/channels
No session ID specified
Opening websocket /api/kernels/0fb50ca9-b8ee-4030-9207-958ff3bab86a/channels
Getting buffer for 0fb50ca9-b8ee-4030-9207-958ff3bab86a
Clearing buffer for 0fb50ca9-b8ee-4030-9207-958ff3bab86a
Connecting to: tcp://127.0.0.1:51318
Connecting to: tcp://127.0.0.1:51317
Connecting to: tcp://127.0.0.1:51321
Connecting to: tcp://127.0.0.1:51319
Connecting to: tcp://127.0.0.1:51317
Connecting to: tcp://127.0.0.1:51321
Nudge: attempt 1 on kernel 0fb50ca9-b8ee-4030-9207-958ff3bab86a
Nudge: IOPub received: 0fb50ca9-b8ee-4030-9207-958ff3bab86a
Nudge: resolving iopub future: 0fb50ca9-b8ee-4030-9207-958ff3bab86a
activity on 0fb50ca9-b8ee-4030-9207-958ff3bab86a: status (busy)
activity on 0fb50ca9-b8ee-4030-9207-958ff3bab86a: status (idle)
Nudge: shell info reply received: 0fb50ca9-b8ee-4030-9207-958ff3bab86a
Nudge: resolving shell future: 0fb50ca9-b8ee-4030-9207-958ff3bab86a
activity on 0fb50ca9-b8ee-4030-9207-958ff3bab86a: status (busy)
activity on 0fb50ca9-b8ee-4030-9207-958ff3bab86a: status (idle)
activity on 0fb50ca9-b8ee-4030-9207-958ff3bab86a: status (busy)
activity on 0fb50ca9-b8ee-4030-9207-958ff3bab86a: status (idle)
activity on 0fb50ca9-b8ee-4030-9207-958ff3bab86a: status (busy)
activity on 0fb50ca9-b8ee-4030-9207-958ff3bab86a: comm_close
activity on 0fb50ca9-b8ee-4030-9207-958ff3bab86a: status (idle)
activity on 0fb50ca9-b8ee-4030-9207-958ff3bab86a: status (busy)
activity on 0fb50ca9-b8ee-4030-9207-958ff3bab86a: status (idle)




Websocket closed 0fb50ca9-b8ee-4030-9207-958ff3bab86a:93c6b3dd-fe2911af8651bfe9c58bd8f9
Starting buffering for 0fb50ca9-b8ee-4030-9207-958ff3bab86a:93c6b3dd-fe2911af8651bfe9c58bd8f9
Clearing buffer for 0fb50ca9-b8ee-4030-9207-958ff3bab86a

# received restart from User / UI
Current value of the kernel.restarting: False
Going ahead to process kernel restart request.
restarting kernel with value for now: False
activity on 0fb50ca9-b8ee-4030-9207-958ff3bab86a: status (busy)
activity on 0fb50ca9-b8ee-4030-9207-958ff3bab86a: shutdown_reply
Buffering msg on 0fb50ca9-b8ee-4030-9207-958ff3bab86a:iopub
activity on 0fb50ca9-b8ee-4030-9207-958ff3bab86a: status (idle)
Buffering msg on 0fb50ca9-b8ee-4030-9207-958ff3bab86a:iopub
Buffering msg on 0fb50ca9-b8ee-4030-9207-958ff3bab86a:iopub
Instantiating kernel 'Python 3 (ipykernel)' with process proxy: enterprise_gateway.services.processproxies.processproxy.LocalProcessProxy
Starting kernel (async): ['/Users/rhgoyal/miniconda3/envs/enterprise-gateway-dev/bin/python', '-m', 'ipykernel_launcher', '-f', '/Users/rhgoyal/Library/Jupyter/runtime/kernel-0fb50ca9-b8ee-4030-9207-958ff3bab86a.json']
Launching kernel: 'Python 3 (ipykernel)' with command: ['/Users/rhgoyal/miniconda3/envs/enterprise-gateway-dev/bin/python', '-m', 'ipykernel_launcher', '-f', '/Users/rhgoyal/Library/Jupyter/runtime/kernel-0fb50ca9-b8ee-4030-9207-958ff3bab86a.json']
BaseProcessProxy.launch_process() env: {'PATH': '/usr/local/opt/[email protected]/bin:/Users/rhgoyal/miniconda3/envs/enterprise-gateway-dev/bin:/Users/rhgoyal/miniconda3/condabin:/Users/rhgoyal/miniconda3/bin:/Users/rhgoyal/learning/installs/apache-maven-3.6.3/bin:/Users/rhgoyal/.toolbox/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin', 'KERNEL_EXTRA_SPARK_OPTS': ' --conf spark.kubernetes.executor.label.component=worker ', 'KERNEL_LAUNCH_TIMEOUT': '150', 'KERNEL_WORKING_DIR': '/Users/rhgoyal/learning/k8s/jupyter', 'KERNEL_USERNAME': 'rhgoyal', 'KERNEL_GATEWAY': '1', 'KERNEL_ID': '0fb50ca9-b8ee-4030-9207-958ff3bab86a', 'KERNEL_LANGUAGE': 'python', 'EG_IMPERSONATION_ENABLED': 'False'}
Local kernel launched on '192.168.0.106', pid: 65678, pgid: 65678, KernelID: 0fb50ca9-b8ee-4030-9207-958ff3bab86a, cmd: '['/Users/rhgoyal/miniconda3/envs/enterprise-gateway-dev/bin/python', '-m', 'ipykernel_launcher', '-f', '/Users/rhgoyal/Library/Jupyter/runtime/kernel-0fb50ca9-b8ee-4030-9207-958ff3bab86a.json']'
Connecting to: tcp://127.0.0.1:51321
Refreshing kernel session for id: 0fb50ca9-b8ee-4030-9207-958ff3bab86a
Kernel restarted: 0fb50ca9-b8ee-4030-9207-958ff3bab86a
Connecting to: tcp://127.0.0.1:51317
Resetting kernel.restarting flag to False.
Initializing websocket connection /api/kernels/0fb50ca9-b8ee-4030-9207-958ff3bab86a/channels
No session ID specified
Opening websocket /api/kernels/0fb50ca9-b8ee-4030-9207-958ff3bab86a/channels
Getting buffer for 0fb50ca9-b8ee-4030-9207-958ff3bab86a
Clearing buffer for 0fb50ca9-b8ee-4030-9207-958ff3bab86a
Discarding 3 buffered messages for 0fb50ca9-b8ee-4030-9207-958ff3bab86a:93c6b3dd-fe2911af8651bfe9c58bd8f9
Connecting to: tcp://127.0.0.1:51318
Connecting to: tcp://127.0.0.1:51317
Connecting to: tcp://127.0.0.1:51321
Connecting to: tcp://127.0.0.1:51319
Connecting to: tcp://127.0.0.1:51317
Connecting to: tcp://127.0.0.1:51321
Nudge: attempt 1 on kernel 0fb50ca9-b8ee-4030-9207-958ff3bab86a
Nudge: attempt 2 on kernel 0fb50ca9-b8ee-4030-9207-958ff3bab86a
Nudge: IOPub received: 0fb50ca9-b8ee-4030-9207-958ff3bab86a
Nudge: resolving iopub future: 0fb50ca9-b8ee-4030-9207-958ff3bab86a
Nudge: shell info reply received: 0fb50ca9-b8ee-4030-9207-958ff3bab86a
Nudge: resolving shell future: 0fb50ca9-b8ee-4030-9207-958ff3bab86a
activity on 0fb50ca9-b8ee-4030-9207-958ff3bab86a: status (busy)
activity on 0fb50ca9-b8ee-4030-9207-958ff3bab86a: status (idle)
Kernel info reply received: 0fb50ca9-b8ee-4030-9207-958ff3bab86a

kevin-bates

Thanks for the PR @rahul26goyal. As I mentioned on the issue, it would be great if you could spend some time with EG 2.6 to reproduce this. This PR is essentially a poll loop and strikes me as analogous to the "pending kernels" work in jupyter_client >= 7 contains (and EG can't leverage). That said, if we find it helps your situation, it seems fairly harmless to include.

kevin-bates · 2022-03-28T21:24:31Z

enterprise_gateway/services/kernels/handlers.py

@@ -133,6 +135,33 @@ def get(self, kernel_id):
        model = km.kernel_model(kernel_id)
        self.finish(json.dumps(model, default=date_default))

+    @web.authenticated
+    async def delete(self, kernel_id):
+        self.log.info(f"Received Shutdown for Kernel : {kernel_id}")


Please convert to debug.

kevin-bates · 2022-03-28T21:27:07Z

enterprise_gateway/services/kernels/handlers.py

+            self.log.info(f"Going to Poll every {poll_time} seconds for next {timeout} "
+                          f"seconds for Kernel to come out of restart.")


Let's adjust the text to sound more definitive. How about...

Suggested change

self.log.info(f"Going to Poll every {poll_time} seconds for next {timeout} "

f"seconds for Kernel to come out of restart.")

self.log.info(f"Kernel is restarting when shutdown request received. Polling every {poll_time} seconds for next {timeout} "

f"seconds for kernel '{kernel_id}' to complete its restart, then will proceed with its shutdown.")

sounds good 👍

kevin-bates · 2022-03-28T21:32:11Z

enterprise_gateway/services/kernels/handlers.py

+            while kernel.restarting:
+                now = int(time.time())
+                if (now - start_time) > timeout:
+                    self.log.info("Existing out of the shutdown wait loop to terminate the kernel anyways.")


How about

Suggested change

self.log.info("Existing out of the shutdown wait loop to terminate the kernel anyways.")

self.log.info(f"Restart timeout has been exceeded for kernel '{kernel_id}'. Proceeding with shutdown.")

kevin-bates · 2022-03-28T21:33:14Z

enterprise_gateway/services/kernels/handlers.py

+                    break
+                self.log.info(f"going to sleep for {poll_time}")  # TODO remove  this.
+                await asyncio.sleep(poll_time)
+                time.sleep(5)


This should be removed as it blocks the server (and is redundant).

my bad..it was a test code which slipped into this PR.

kevin-bates · 2022-03-28T21:56:52Z

enterprise_gateway/services/kernels/handlers.py

+        if kernel.restarting:
+            start_time = int(time.time())  # epoc time in seconds
+            timeout = km.kernel_info_timeout  # this could be set to kernel launch timeout to be in sync!
+            poll_time = 5  # we can make this configurable


I agree. I think an EG_-prefixed env is probably sufficient for this - rather than a full-blown configurable.
I'm wondering if we should use a smaller value (like 1.0 second) so we can detect the restart's completion sooner.
(If configured via an env, let's make sure it handles floats)

agree.

We can have smaller default value for poll_time to detect restart sooner while providing an override option.

kevin-bates · 2022-03-28T21:58:59Z

enterprise_gateway/services/kernels/handlers.py

+        kernel = km.get_kernel(kernel_id)
+        if kernel.restarting:
+            start_time = int(time.time())  # epoc time in seconds
+            timeout = km.kernel_info_timeout  # this could be set to kernel launch timeout to be in sync!


I'm not sure we have access to the value but the kernel-launch-timeout would probably be more appropriate since restart is s superset of launch.

I have tried incorporated this fix by moving the kernel_launch_timeout as a Kernel property which seems the right place for it?

rahul26goyal · 2022-03-29T03:23:32Z

Thanks a lot for the review comments @kevin-bates. I will go over the comments today and address those.
On Testing side:

While testing the earlier fix, we found issue with duplicate "Restart" request which could cause the same race condition. Added the code change to handle the same.
Thanks!

kevin-bates · 2022-03-29T17:54:26Z

Hi @rahul26goyal. Is there a reason you chose to add the duplicate restart logic in the RemoteMappingKernelManager rather than in the kernel action handler (similar to what you did for shutdown)? There's a definite pattern taking shape here and I hope we can move this kind of code into a common function of sorts.

rahul26goyal · 2022-03-30T03:37:40Z

Hi @rahul26goyal. Is there a reason you chose to add the duplicate restart logic in the RemoteMappingKernelManager rather than in the kernel action handler (similar to what you did for shutdown)? There's a definite pattern taking shape here and I hope we can move this kind of code into a common function of sorts.

I agree that its a duplicate code and a definite pattern being followed,
There are 2 reasons I wrote it this way:

The remove_kernel method in RemoteMappingKernelManager or any other Manager is not a async function. So, I guess it would not allow any asyncio stuff there? Thus, I implemented the polling in the handler which is an async method.
The current list of Kernel handlers on EG did not extend KernelActionHandler which handles the restart request. So, I ended up writing it in the RemoteMappingKernelManager::restart_kernel which looked like a good place.

If we can find a solution for Point 1, I could reuse the pattern and implement the logic to poll in RemoteMappingKernelManager which both shutdown and restart can use.

If Point 1 can not be solved, than I can extend the KernelActionHandler in EG and write the polling logic there.

Please let me know on how can proceed further.

kevin-bates · 2022-03-31T14:59:48Z

I've been giving this some thought over the last couple of days. Basically, this logic should be colocated and the question is where should that occur.

I view this PR as another implementation of pending kernels, but one that doesn't use a Future to determine that the pending portion of things has completed. Since EG cannot leverage that functionality (at this time) we're faced with rolling our own temporarily until we can leverage pending kernels. (This will require EG's transition to kernel provisioners.)

So, if we were to roll our own temporary solution, it seems clear that we'd want to do this in similar locations as what is done in jupyter_client v7 (albeit in the KernelManager subclasses we implement) and, preferably, in a nearly identical manner if possible. This last comment implies the use of AsyncKernelManager.

The EG KernelManager seems to be right colocation IMO because its the EG KernelManager overrides that will essentially dissolve once kernel provisioner support is achieved. Yes, we will likely still have subclasses to accomplish "enterprise" kinds of functionality - like HA/DR, load-balancing, etc.

This all said, I think it would be good to get together and hash this out further. Let's chat via gitter and come up with a time we can meet. Thank you.

enterprise_gateway/services/kernels/remotemanager.py

kevin-bates · 2022-04-04T23:19:37Z

enterprise_gateway/services/kernels/remotemanager.py

+
+    async def wait_and_poll_for_restart_to_complete(self, kernel_id, action="shutdown"):
+        kernel = self.get_kernel(kernel_id)
+        start_time = int(time.time())  # epoc time in seconds


We should tolerate milliseconds since we'll want to allow for a sub-second interval.

Suggested change

start_time = int(time.time()) # epoc time in seconds

start_time: float = time.time() # epoc time

kevin-bates · 2022-04-04T23:21:49Z

enterprise_gateway/services/kernels/remotemanager.py

+        except KeyError as ke:  # this is hit for multiple shutdown request.
+            self.log.exception(f"Exception while shutting Kernel: {kernel_id}: {ke}")
+
+    async def wait_and_poll_for_restart_to_complete(self, kernel_id, action="shutdown"):


This seems a little verbose, perhaps something more like:

Suggested change

async def wait_and_poll_for_restart_to_complete(self, kernel_id, action="shutdown"):

async def wait_for_restart(self, kernel_id: str, action:str = "shutdown"):

kevin-bates · 2022-04-04T23:24:26Z

enterprise_gateway/services/kernels/remotemanager.py

@@ -186,6 +190,50 @@ async def start_kernel(self, *args, **kwargs):
        self.parent.kernel_session_manager.create_session(kernel_id, **kwargs)
        return kernel_id

+    async def restart_kernel(self, kernel_id):
+        kernel = self.get_kernel(kernel_id)
+        self.log.debug(f"Current value of the Kernel Restarting: {kernel.restarting}")


Could we please lowercase Kernel and Restarting in its various places? Might as well reference the variable names as well.

Suggested change

self.log.debug(f"Current value of the Kernel Restarting: {kernel.restarting}")

self.log.debug(f"Current value of the 'kernel.restarting': {kernel.restarting}")

for more information, see https://pre-commit.ci

rahul26goyal · 2022-04-29T16:37:47Z

hi @kevin-bates :
we tested the race condition on v2.6.0 without this fix and we still see the FD leak happening with Kubernetes Kernels.

kevin-bates

Hi Rahul - thanks for doing this and the tremendous troubleshooting efforts this took - this is not easy to diagnose!

I had some relatively minor comments regarding the logging of information. If you have found certain statements I asked to be removed to be essential, please feel free to push back and we can discuss. Thanks.

kevin-bates · 2022-04-29T22:10:08Z

enterprise_gateway/services/kernels/remotemanager.py

+            self.log.info(
+                f"Done with waiting for restart to complete. Current value of kernel.restarting: {kernel.restarting}. "
+                f"Skipping kernel restart."
+            )


Let's remove this. We can infer this from any logging the wait_for_restart_finish() does.

makes sense.. ll remove these/

kevin-bates · 2022-04-29T22:10:27Z

enterprise_gateway/services/kernels/remotemanager.py

+                f"Skipping kernel restart."
+            )
+            return
+        self.log.info("Going ahead to process kernel restart request.")


Please remove.

kevin-bates · 2022-04-29T22:10:54Z

enterprise_gateway/services/kernels/remotemanager.py

+            kernel.restarting = True  # Moved in out of RemoteKernelManager
+            await super().restart_kernel(kernel_id)
+        finally:
+            self.log.debug("Resetting kernel.restarting flag to False.")


I don't think this is very helpful, please remove.

kevin-bates · 2022-04-29T22:11:28Z

enterprise_gateway/services/kernels/remotemanager.py

+
+    async def shutdown_kernel(self, kernel_id, now=False, restart=False):
+        kernel = self.get_kernel(kernel_id)
+        self.log.debug(f"Current value of the Kernel Restarting: {kernel.restarting}")


Can be inferred. Please remove.

kevin-bates · 2022-04-29T22:13:23Z

enterprise_gateway/services/kernels/remotemanager.py

+        try:
+            await super().shutdown_kernel(kernel_id, now, restart)
+        except KeyError as ke:  # this is hit for multiple shutdown request.
+            self.log.exception(f"Exception while shutting Kernel: {kernel_id}: {ke}")


Suggested change

self.log.exception(f"Exception while shutting Kernel: {kernel_id}: {ke}")

self.log.exception(f"Exception while shutting down kernel: '{kernel_id}': {ke}")

Shouldn't this be re-raised? Or perhaps only if 'restarting' == False?

this usually happens when we have sent duplicate kernel shutdown request while the kernel was still restarting.
I will testing this by raising the exception and get back with the behaviour.

Hi @kevin-bates : I tested various scenarios for this:

The default behaviour of the JEG when multiple shutdown request is sent for the same kernel ID is except the first requests which returns 204 after shutting down the kernel, the other requests get en exception while popping out the kernel_id from MultiKernelManager._kernel dict. This leads to KeyError: 'b0b149e8-8a22-48db-b8b6-ed90e441ac41' and the final response thrown to the clients is HTTP 500: 500 DELETE /api/kernels/b0b149e8-8a22-48db-b8b6-ed90e441ac41.

With the current code change in place, we are handling the exception gracefully and returning a HTTP response 204. This behaviour is different from the point 1 above but this is similar to to the scenario where if you had sent a shutdown request for the kernel_id which does not exist. https://github.com/jupyter-server/jupyter_server/blob/main/jupyter_server/services/kernels/kernelmanager.py#L654

let me know how to proceed forward with this.
I feel in order to keep the behaviour same, we can skip the graceful handling and let it throw 500 to user as before.

Could we please update the text of the log statement to the suggested value? It doesn't adhere to the conventions of the others (lower-case kernel, quoted kernel_id, and adds 'down' to complete the action).

Sorry, I just realized I didn't answer your question above.

Hmm. The timing issue you describe goes beyond the current checks made in the _check_kernel_id() override in which we'll return a 404 if the kernel_id is unknown to the MultiKernelManager - correct? (I had fixed one such race condition a while okay related to the _kernel_connections list.)

So I guess a similar issue still exists wrt to _kernels and the try/except KeyError block is a catch for that. (You're using the latest server code with AsyncKernelManagement - correct?)

I like the fact that we'd no longer raise a 500 in this case, and most cases of this nature will result in a 404 due to the _check_kernel_id() override. I think it could be misleading to return 204 when such requests did not delete the resource (kernel) and believe the best status, in this case, would be 404 - since that's the truth and essentially exhibits similar behavior to when the kernel-id is not managed (which is also true).

yes @kevin-bates ..your observations are correct and I am testing with AsyncKernelManagement.
The next step here is to raise a 404 similar to check_kernel_id.

raise web.HTTPError(404, "Kernel does not exist: %s" % kernel_id)

Also, should I further raise a CR in AsyncMappingKernelManager::shutdown_kernel to handle the KeyError exception in jupyter server repo?

Under normal circumstances, I'd say, yeah, if you can reproduce it in that environment. But that environment (outside of EG) will be using the pending kernel support - in which case I'd be surprised this can be reproduced. So, it comes down to, "is it worth a patch to jupyter_client 6.x?" and I'd say, let the status quo flow and live with the fact that 500 will be returned in this relatively rare scenario. If we find issues here when converting to provisions, we can tackle this then, but this particular area of the stack is still evolving (e.g., pending kernels will likely be subsumed by the state machine work).

kevin-bates · 2022-04-29T22:18:03Z

enterprise_gateway/services/kernels/remotemanager.py

+        self.log.info(
+            f"kernel is restarting when {action} request received. Polling every {poll_time} "
+            f"seconds for next {timeout} seconds for kernel '{kernel_id}'"
+            f" to complete its restart."
+        )


Suggested change

self.log.info(

f"kernel is restarting when {action} request received. Polling every {poll_time} "

f"seconds for next {timeout} seconds for kernel '{kernel_id}'"

f" to complete its restart."

)

self.log.info(

f"Kernel '{kernel_id}' was restarting when {action} request received. Polling every {poll_time} "

f"seconds for next {timeout} seconds for kernel to complete its restart."

)

sounds good. ll update.

kevin-bates · 2022-04-29T22:21:27Z

enterprise_gateway/services/kernels/remotemanager.py

+                self.log.info(
+                    f"Wait_Timeout: Existing out of the restart wait loop to {action} kernel."
+                )


Let's change this to debug...

Suggested change

self.log.info(

f"Wait_Timeout: Existing out of the restart wait loop to {action} kernel."

)

self.log.debug(

f"Timeout: Exiting restart wait loop in order to {action} kernel '{kernel_id}'."

)

I can change this to debug but I think this is an important log line that needs to be visible by default. Pls give this another thought and let me know .

Ok. Leaving at INFO seems ok. I'm not sure how useful it is other than to perhaps get an idea that a race condition occurred during restart/restart or restart/shutdown. That said, if there is noise, this might be something for operators to look into.

yes @kevin-bates .. this will be helpful to debug any new issue that might arise due to this new code flow.

kevin-bates · 2022-04-29T22:23:59Z

enterprise_gateway/services/kernels/remotemanager.py

+        self.log.info(
+            f"Returning with current value of the kernel.restarting: {kernel.restarting}."
+        )


Suggested change

self.log.info(

f"Returning with current value of the kernel.restarting: {kernel.restarting}."

)

self.log.debug(

f"Returning from restart-wait with kernel.restarting value: {kernel.restarting} for kernel '{kernel_id}'."

)

this also can be inferred ..so removing this log line completely.

kevin-bates · 2022-04-29T22:25:18Z

enterprise_gateway/services/kernels/remotemanager.py

+        if env.get("KERNEL_LAUNCH_TIMEOUT", None):
+            self.kernel_launch_timeout = float(env.get("KERNEL_LAUNCH_TIMEOUT"))


We can scrap the if statement...

Suggested change

if env.get("KERNEL_LAUNCH_TIMEOUT", None):

self.kernel_launch_timeout = float(env.get("KERNEL_LAUNCH_TIMEOUT"))

self.kernel_launch_timeout = float(env.get("KERNEL_LAUNCH_TIMEOUT", default_kernel_launch_timeout))

L398, I have already initialised with the default value. So, I skipped setting this again.
let me know if we still need to add this here?

I understand. It just seems cleaner and more maintainable to not have the extra if statement IMO.

okay .. ll make the change

Gentle ping regarding the suggested change.

kevin-bates · 2022-04-29T22:29:10Z

enterprise_gateway/services/kernels/remotemanager.py

+        # this is added for only auto-restarter as it directly call this method.
+        self.log.info(f"restarting kernel with value for now: {now}")
+        if now:
+            self.restarting = True


I don't think this logging is that helpful. When possible, I'd rather have the comment specific to the action. Where it is makes it tough for someone reading the code know what "this is added" is referring to. They might think its the larger block of code.

Suggested change

# this is added for only auto-restarter as it directly call this method.

self.log.info(f"restarting kernel with value for now: {now}")

if now:

self.restarting = True

if now: # if auto-restarting (when now is True), indicate we're restarting.

self.restarting = True

yes. agree kevin.

kevin-bates

Getting closer. One typo and a couple of suggestions.

kevin-bates · 2022-05-03T20:47:32Z

enterprise_gateway/services/kernels/remotemanager.py

@@ -18,6 +19,9 @@
 from ..processproxies.processproxy import LocalProcessProxy, RemoteProcessProxy
 from ..sessions.kernelsessionmanager import KernelSessionManager

+default_kernel_launch_timeout = float(os.getenv("EG_KERNEL_LAUNCH_TIMEOUT", "30"))
+kernel_restart_finish_poll_internal = float(os.getenv("EG_RESTART_FINISH_POLL_INTERVAL", 1.0))


typo

Suggested change

kernel_restart_finish_poll_internal = float(os.getenv("EG_RESTART_FINISH_POLL_INTERVAL", 1.0))

kernel_restart_finish_poll_interval = float(os.getenv("EG_RESTART_FINISH_POLL_INTERVAL", 1.0))

kevin-bates · 2022-05-03T20:47:36Z

enterprise_gateway/services/kernels/remotemanager.py

+        try:
+            await super().shutdown_kernel(kernel_id, now, restart)
+        except KeyError as ke:  # this is hit for multiple shutdown request.
+            self.log.exception(f"Exception while shutting Kernel: {kernel_id}: {ke}")


Could we please update the text of the log statement to the suggested value? It doesn't adhere to the conventions of the others (lower-case kernel, quoted kernel_id, and adds 'down' to complete the action).

kevin-bates · 2022-05-03T20:51:46Z

enterprise_gateway/services/kernels/remotemanager.py

+        if env.get("KERNEL_LAUNCH_TIMEOUT", None):
+            self.kernel_launch_timeout = float(env.get("KERNEL_LAUNCH_TIMEOUT"))


Gentle ping regarding the suggested change.

for more information, see https://pre-commit.ci

kevin-bates

Looks good @rahul26goyal - thank you!
Could you please remove "(WIP)" from the title once you feel this is ready (which I'm assuming is the case)?

rahul26goyal · 2022-05-16T15:31:50Z

hi @kevin-bates
the changes are good to merge now.

kevin-bates · 2022-05-16T15:42:27Z

Great - thanks for the response.

kevin-bates reviewed Mar 28, 2022

View reviewed changes

rahul26goyal force-pushed the fd-leak-restart branch from 4f15f4f to 13c1b7e Compare March 29, 2022 03:49

rahul26goyal force-pushed the fd-leak-restart branch from 7575dad to 8281154 Compare April 4, 2022 17:50

kevin-bates reviewed Apr 4, 2022

View reviewed changes

enterprise_gateway/services/kernels/remotemanager.py Show resolved Hide resolved

kevin-bates reviewed Apr 4, 2022

View reviewed changes

rahul26goyal force-pushed the fd-leak-restart branch from 8281154 to 3365170 Compare April 29, 2022 13:43

rahul26goyal added 2 commits April 29, 2022 20:10

Attempt to fix issue jupyter-server#1051: FD leak due to race condition

2b4607d

nit fixes

15e82ab

rahul26goyal force-pushed the fd-leak-restart branch from f0b0a01 to 15e82ab Compare April 29, 2022 14:41

pre-commit-ci bot and others added 2 commits April 29, 2022 14:41

[pre-commit.ci] auto fixes from pre-commit.com hooks

76ebb5b

for more information, see https://pre-commit.ci

handled the changes for kernel auto restarter

d1a1773

kevin-bates reviewed Apr 29, 2022

View reviewed changes

review comments addressed

857e291

kevin-bates reviewed May 3, 2022

View reviewed changes

review comments addressed

84127bb

rahul26goyal force-pushed the fd-leak-restart branch from f3e7b2c to 84127bb Compare May 4, 2022 05:16

[pre-commit.ci] auto fixes from pre-commit.com hooks

cf3d615

for more information, see https://pre-commit.ci

kevin-bates approved these changes May 4, 2022

View reviewed changes

rahul26goyal changed the title ~~Attempt to fix FD leak issue #1051 (WIP)~~ Attempt to fix FD leak issue #1051 May 16, 2022

kevin-bates merged commit efb2c81 into jupyter-server:main May 16, 2022

kevin-bates mentioned this pull request May 16, 2022

EG becomes unusable after receiving FD added twice error due to a race condition! #1051

Closed

		self.log.info(f"Going to Poll every {poll_time} seconds for next {timeout} "
		f"seconds for Kernel to come out of restart.")

	self.log.info("Existing out of the shutdown wait loop to terminate the kernel anyways.")
	self.log.info(f"Restart timeout has been exceeded for kernel '{kernel_id}'. Proceeding with shutdown.")

	start_time = int(time.time()) # epoc time in seconds
	start_time: float = time.time() # epoc time

	async def wait_and_poll_for_restart_to_complete(self, kernel_id, action="shutdown"):
	async def wait_for_restart(self, kernel_id: str, action:str = "shutdown"):

	self.log.debug(f"Current value of the Kernel Restarting: {kernel.restarting}")
	self.log.debug(f"Current value of the 'kernel.restarting': {kernel.restarting}")

	self.log.exception(f"Exception while shutting Kernel: {kernel_id}: {ke}")
	self.log.exception(f"Exception while shutting down kernel: '{kernel_id}': {ke}")

		if env.get("KERNEL_LAUNCH_TIMEOUT", None):
		self.kernel_launch_timeout = float(env.get("KERNEL_LAUNCH_TIMEOUT"))

	kernel_restart_finish_poll_internal = float(os.getenv("EG_RESTART_FINISH_POLL_INTERVAL", 1.0))
	kernel_restart_finish_poll_interval = float(os.getenv("EG_RESTART_FINISH_POLL_INTERVAL", 1.0))

Attempt to fix FD leak issue #1051 #1054

Attempt to fix FD leak issue #1051 #1054

Conversation

rahul26goyal commented Mar 25, 2022 • edited Loading

Description

How does this prevent the FD leak mentioned on #1051 ?

TESTING

Test Scenario : Kernel is restarted via "auto restarter". And then kernel restart request is sent.

kevin-bates left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rahul26goyal commented Mar 29, 2022

kevin-bates commented Mar 29, 2022 • edited Loading

rahul26goyal commented Mar 30, 2022

kevin-bates commented Mar 31, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rahul26goyal commented Apr 29, 2022

kevin-bates left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kevin-bates left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kevin-bates left a comment

Choose a reason for hiding this comment

rahul26goyal commented May 16, 2022

kevin-bates commented May 16, 2022

rahul26goyal commented Mar 25, 2022 •

edited

Loading

kevin-bates commented Mar 29, 2022 •

edited

Loading