Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attempt to fix FD leak issue #1051 #1054

Merged
merged 7 commits into from
May 16, 2022

Conversation

rahul26goyal
Copy link
Contributor

@rahul26goyal rahul26goyal commented Mar 25, 2022

Description

This PR is raised wrt the fix proposed in #1051 to wait on the Kernel Shutdown request until the kernel Restart is completed.

  • the fix is made only on EG as it depends on the Kernel field set with in the EG code only: kernel.restarting
  • There are some TODOs and open items which are open for discussion. (done)
  • The time used to wait for restart to complete is inherited / taken from "KERNEL_LAUNCH_TIMEOUT" as this is precisely the time which even the kernel start would wait before giving up on the kernel.

How does this prevent the FD leak mentioned on #1051 ?

  • As we are waiting on the restart to complete and no going to allow shutdown to happen in parallel, the chances of Shutdown flow terminating the kernel at the exact same time when Kernel restart is waiting on kernel_info is minimal.
  • This also prevents any other accidental leaks that might happen if a shutdown request is processed while Restart is happening.

TESTING

  • I have done some basic sanity testing with notebook and JupyterLab and it looks to be working fine.

  • Tested the changes on local python kernel and changes seems to be working as expected.

  • when a restart request is followed by shutdown, the shutdown request waits until the restart is completed and then shuts down the kernel cleanly.

  • when a restart request is followed by shutdown and the shutdown times out waiting, it proceeds to forceful shutdown. This was the same behaviour before the current changes were made.

  • when multiple restart request is received for the same kernel, the first shutdown request goes the actual restart while all the other duplicate request wait for the kernel.restarting to become False and return.
    thanks!

Test Scenario : Kernel is restarted via "auto restarter". And then kernel restart request is sent.

# the kernel was already running and we forcefully killed the kernel process.
# this trigger the auto-restarter 
 
AsyncIOLoopKernelRestarter: restarting kernel (1/5), keep random ports
kernel 0fb50ca9-b8ee-4030-9207-958ff3bab86a restarted
 
restarting kernel with value for now: True

Instantiating kernel 'Python 3 (ipykernel)' with process proxy: enterprise_gateway.services.processproxies.processproxy.LocalProcessProxy
Starting kernel (async): ['/Users/rhgoyal/miniconda3/envs/enterprise-gateway-dev/bin/python', '-m', 'ipykernel_launcher', '-f', '/Users/rhgoyal/Library/Jupyter/runtime/kernel-0fb50ca9-b8ee-4030-9207-958ff3bab86a.json']
Launching kernel: 'Python 3 (ipykernel)' with command: ['/Users/rhgoyal/miniconda3/envs/enterprise-gateway-dev/bin/python', '-m', 'ipykernel_launcher', '-f', '/Users/rhgoyal/Library/Jupyter/runtime/kernel-0fb50ca9-b8ee-4030-9207-958ff3bab86a.json']
BaseProcessProxy.launch_process() env: {'PATH': '/usr/local/opt/[email protected]/bin:/Users/rhgoyal/miniconda3/envs/enterprise-gateway-dev/bin:/Users/rhgoyal/miniconda3/condabin:/Users/rhgoyal/miniconda3/bin:/Users/rhgoyal/learning/installs/apache-maven-3.6.3/bin:/Users/rhgoyal/.toolbox/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin', 'KERNEL_EXTRA_SPARK_OPTS': ' --conf spark.kubernetes.executor.label.component=worker ', 'KERNEL_LAUNCH_TIMEOUT': '150', 'KERNEL_WORKING_DIR': '/Users/rhgoyal/learning/k8s/jupyter', 'KERNEL_USERNAME': 'rhgoyal', 'KERNEL_GATEWAY': '1', 'KERNEL_ID': '0fb50ca9-b8ee-4030-9207-958ff3bab86a', 'KERNEL_LANGUAGE': 'python', 'EG_IMPERSONATION_ENABLED': 'False'}
Local kernel launched on '192.168.0.106', pid: 65658, pgid: 65658, KernelID: 0fb50ca9-b8ee-4030-9207-958ff3bab86a, cmd: '['/Users/rhgoyal/miniconda3/envs/enterprise-gateway-dev/bin/python', '-m', 'ipykernel_launcher', '-f', '/Users/rhgoyal/Library/Jupyter/runtime/kernel-0fb50ca9-b8ee-4030-9207-958ff3bab86a.json']'
Connecting to: tcp://127.0.0.1:51321
Refreshing kernel session for id: 0fb50ca9-b8ee-4030-9207-958ff3bab86a
activity on 0fb50ca9-b8ee-4030-9207-958ff3bab86a: status (busy)
activity on 0fb50ca9-b8ee-4030-9207-958ff3bab86a: status (idle)
AsyncIOLoopKernelRestarter: restart apparently succeeded



Websocket closed 0fb50ca9-b8ee-4030-9207-958ff3bab86a:d78a447e-56649b2aad7998e7efa537be
Starting buffering for 0fb50ca9-b8ee-4030-9207-958ff3bab86a:d78a447e-56649b2aad7998e7efa537be
Clearing buffer for 0fb50ca9-b8ee-4030-9207-958ff3bab86a
Initializing websocket connection /api/kernels/0fb50ca9-b8ee-4030-9207-958ff3bab86a/channels
No session ID specified
Opening websocket /api/kernels/0fb50ca9-b8ee-4030-9207-958ff3bab86a/channels
Getting buffer for 0fb50ca9-b8ee-4030-9207-958ff3bab86a
Clearing buffer for 0fb50ca9-b8ee-4030-9207-958ff3bab86a
Connecting to: tcp://127.0.0.1:51318
Connecting to: tcp://127.0.0.1:51317
Connecting to: tcp://127.0.0.1:51321
Connecting to: tcp://127.0.0.1:51319
Connecting to: tcp://127.0.0.1:51317
Connecting to: tcp://127.0.0.1:51321
Nudge: attempt 1 on kernel 0fb50ca9-b8ee-4030-9207-958ff3bab86a
Nudge: IOPub received: 0fb50ca9-b8ee-4030-9207-958ff3bab86a
Nudge: resolving iopub future: 0fb50ca9-b8ee-4030-9207-958ff3bab86a
activity on 0fb50ca9-b8ee-4030-9207-958ff3bab86a: status (busy)
activity on 0fb50ca9-b8ee-4030-9207-958ff3bab86a: status (idle)
Nudge: shell info reply received: 0fb50ca9-b8ee-4030-9207-958ff3bab86a
Nudge: resolving shell future: 0fb50ca9-b8ee-4030-9207-958ff3bab86a
activity on 0fb50ca9-b8ee-4030-9207-958ff3bab86a: status (busy)
activity on 0fb50ca9-b8ee-4030-9207-958ff3bab86a: status (idle)
activity on 0fb50ca9-b8ee-4030-9207-958ff3bab86a: status (busy)
activity on 0fb50ca9-b8ee-4030-9207-958ff3bab86a: status (idle)
activity on 0fb50ca9-b8ee-4030-9207-958ff3bab86a: status (busy)
activity on 0fb50ca9-b8ee-4030-9207-958ff3bab86a: comm_close
activity on 0fb50ca9-b8ee-4030-9207-958ff3bab86a: status (idle)
activity on 0fb50ca9-b8ee-4030-9207-958ff3bab86a: status (busy)
activity on 0fb50ca9-b8ee-4030-9207-958ff3bab86a: status (idle)




Websocket closed 0fb50ca9-b8ee-4030-9207-958ff3bab86a:93c6b3dd-fe2911af8651bfe9c58bd8f9
Starting buffering for 0fb50ca9-b8ee-4030-9207-958ff3bab86a:93c6b3dd-fe2911af8651bfe9c58bd8f9
Clearing buffer for 0fb50ca9-b8ee-4030-9207-958ff3bab86a

# received restart from User / UI
Current value of the kernel.restarting: False
Going ahead to process kernel restart request.
restarting kernel with value for now: False
activity on 0fb50ca9-b8ee-4030-9207-958ff3bab86a: status (busy)
activity on 0fb50ca9-b8ee-4030-9207-958ff3bab86a: shutdown_reply
Buffering msg on 0fb50ca9-b8ee-4030-9207-958ff3bab86a:iopub
activity on 0fb50ca9-b8ee-4030-9207-958ff3bab86a: status (idle)
Buffering msg on 0fb50ca9-b8ee-4030-9207-958ff3bab86a:iopub
Buffering msg on 0fb50ca9-b8ee-4030-9207-958ff3bab86a:iopub
Instantiating kernel 'Python 3 (ipykernel)' with process proxy: enterprise_gateway.services.processproxies.processproxy.LocalProcessProxy
Starting kernel (async): ['/Users/rhgoyal/miniconda3/envs/enterprise-gateway-dev/bin/python', '-m', 'ipykernel_launcher', '-f', '/Users/rhgoyal/Library/Jupyter/runtime/kernel-0fb50ca9-b8ee-4030-9207-958ff3bab86a.json']
Launching kernel: 'Python 3 (ipykernel)' with command: ['/Users/rhgoyal/miniconda3/envs/enterprise-gateway-dev/bin/python', '-m', 'ipykernel_launcher', '-f', '/Users/rhgoyal/Library/Jupyter/runtime/kernel-0fb50ca9-b8ee-4030-9207-958ff3bab86a.json']
BaseProcessProxy.launch_process() env: {'PATH': '/usr/local/opt/[email protected]/bin:/Users/rhgoyal/miniconda3/envs/enterprise-gateway-dev/bin:/Users/rhgoyal/miniconda3/condabin:/Users/rhgoyal/miniconda3/bin:/Users/rhgoyal/learning/installs/apache-maven-3.6.3/bin:/Users/rhgoyal/.toolbox/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin', 'KERNEL_EXTRA_SPARK_OPTS': ' --conf spark.kubernetes.executor.label.component=worker ', 'KERNEL_LAUNCH_TIMEOUT': '150', 'KERNEL_WORKING_DIR': '/Users/rhgoyal/learning/k8s/jupyter', 'KERNEL_USERNAME': 'rhgoyal', 'KERNEL_GATEWAY': '1', 'KERNEL_ID': '0fb50ca9-b8ee-4030-9207-958ff3bab86a', 'KERNEL_LANGUAGE': 'python', 'EG_IMPERSONATION_ENABLED': 'False'}
Local kernel launched on '192.168.0.106', pid: 65678, pgid: 65678, KernelID: 0fb50ca9-b8ee-4030-9207-958ff3bab86a, cmd: '['/Users/rhgoyal/miniconda3/envs/enterprise-gateway-dev/bin/python', '-m', 'ipykernel_launcher', '-f', '/Users/rhgoyal/Library/Jupyter/runtime/kernel-0fb50ca9-b8ee-4030-9207-958ff3bab86a.json']'
Connecting to: tcp://127.0.0.1:51321
Refreshing kernel session for id: 0fb50ca9-b8ee-4030-9207-958ff3bab86a
Kernel restarted: 0fb50ca9-b8ee-4030-9207-958ff3bab86a
Connecting to: tcp://127.0.0.1:51317
Resetting kernel.restarting flag to False.
Initializing websocket connection /api/kernels/0fb50ca9-b8ee-4030-9207-958ff3bab86a/channels
No session ID specified
Opening websocket /api/kernels/0fb50ca9-b8ee-4030-9207-958ff3bab86a/channels
Getting buffer for 0fb50ca9-b8ee-4030-9207-958ff3bab86a
Clearing buffer for 0fb50ca9-b8ee-4030-9207-958ff3bab86a
Discarding 3 buffered messages for 0fb50ca9-b8ee-4030-9207-958ff3bab86a:93c6b3dd-fe2911af8651bfe9c58bd8f9
Connecting to: tcp://127.0.0.1:51318
Connecting to: tcp://127.0.0.1:51317
Connecting to: tcp://127.0.0.1:51321
Connecting to: tcp://127.0.0.1:51319
Connecting to: tcp://127.0.0.1:51317
Connecting to: tcp://127.0.0.1:51321
Nudge: attempt 1 on kernel 0fb50ca9-b8ee-4030-9207-958ff3bab86a
Nudge: attempt 2 on kernel 0fb50ca9-b8ee-4030-9207-958ff3bab86a
Nudge: IOPub received: 0fb50ca9-b8ee-4030-9207-958ff3bab86a
Nudge: resolving iopub future: 0fb50ca9-b8ee-4030-9207-958ff3bab86a
Nudge: shell info reply received: 0fb50ca9-b8ee-4030-9207-958ff3bab86a
Nudge: resolving shell future: 0fb50ca9-b8ee-4030-9207-958ff3bab86a
activity on 0fb50ca9-b8ee-4030-9207-958ff3bab86a: status (busy)
activity on 0fb50ca9-b8ee-4030-9207-958ff3bab86a: status (idle)
Kernel info reply received: 0fb50ca9-b8ee-4030-9207-958ff3bab86a

Copy link
Member

@kevin-bates kevin-bates left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @rahul26goyal. As I mentioned on the issue, it would be great if you could spend some time with EG 2.6 to reproduce this. This PR is essentially a poll loop and strikes me as analogous to the "pending kernels" work in jupyter_client >= 7 contains (and EG can't leverage). That said, if we find it helps your situation, it seems fairly harmless to include.

@@ -133,6 +135,33 @@ def get(self, kernel_id):
model = km.kernel_model(kernel_id)
self.finish(json.dumps(model, default=date_default))

@web.authenticated
async def delete(self, kernel_id):
self.log.info(f"Received Shutdown for Kernel : {kernel_id}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please convert to debug.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

Comment on lines 148 to 149
self.log.info(f"Going to Poll every {poll_time} seconds for next {timeout} "
f"seconds for Kernel to come out of restart.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's adjust the text to sound more definitive. How about...

Suggested change
self.log.info(f"Going to Poll every {poll_time} seconds for next {timeout} "
f"seconds for Kernel to come out of restart.")
self.log.info(f"Kernel is restarting when shutdown request received. Polling every {poll_time} seconds for next {timeout} "
f"seconds for kernel '{kernel_id}' to complete its restart, then will proceed with its shutdown.")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good 👍

while kernel.restarting:
now = int(time.time())
if (now - start_time) > timeout:
self.log.info("Existing out of the shutdown wait loop to terminate the kernel anyways.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about

Suggested change
self.log.info("Existing out of the shutdown wait loop to terminate the kernel anyways.")
self.log.info(f"Restart timeout has been exceeded for kernel '{kernel_id}'. Proceeding with shutdown.")

break
self.log.info(f"going to sleep for {poll_time}") # TODO remove this.
await asyncio.sleep(poll_time)
time.sleep(5)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be removed as it blocks the server (and is redundant).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my bad..it was a test code which slipped into this PR.

if kernel.restarting:
start_time = int(time.time()) # epoc time in seconds
timeout = km.kernel_info_timeout # this could be set to kernel launch timeout to be in sync!
poll_time = 5 # we can make this configurable
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I think an EG_-prefixed env is probably sufficient for this - rather than a full-blown configurable.
I'm wondering if we should use a smaller value (like 1.0 second) so we can detect the restart's completion sooner.
(If configured via an env, let's make sure it handles floats)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree.

  1. We can have smaller default value for poll_time to detect restart sooner while providing an override option.

kernel = km.get_kernel(kernel_id)
if kernel.restarting:
start_time = int(time.time()) # epoc time in seconds
timeout = km.kernel_info_timeout # this could be set to kernel launch timeout to be in sync!
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we have access to the value but the kernel-launch-timeout would probably be more appropriate since restart is s superset of launch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tried incorporated this fix by moving the kernel_launch_timeout as a Kernel property which seems the right place for it?

@rahul26goyal
Copy link
Contributor Author

Thanks a lot for the review comments @kevin-bates. I will go over the comments today and address those.
On Testing side:

  1. While testing the earlier fix, we found issue with duplicate "Restart" request which could cause the same race condition. Added the code change to handle the same.
    Thanks!

@kevin-bates
Copy link
Member

kevin-bates commented Mar 29, 2022

Hi @rahul26goyal. Is there a reason you chose to add the duplicate restart logic in the RemoteMappingKernelManager rather than in the kernel action handler (similar to what you did for shutdown)? There's a definite pattern taking shape here and I hope we can move this kind of code into a common function of sorts.

@rahul26goyal
Copy link
Contributor Author

Hi @rahul26goyal. Is there a reason you chose to add the duplicate restart logic in the RemoteMappingKernelManager rather than in the kernel action handler (similar to what you did for shutdown)? There's a definite pattern taking shape here and I hope we can move this kind of code into a common function of sorts.

I agree that its a duplicate code and a definite pattern being followed,
There are 2 reasons I wrote it this way:

  1. The remove_kernel method in RemoteMappingKernelManager or any other Manager is not a async function. So, I guess it would not allow any asyncio stuff there? Thus, I implemented the polling in the handler which is an async method.

  2. The current list of Kernel handlers on EG did not extend KernelActionHandler which handles the restart request. So, I ended up writing it in the RemoteMappingKernelManager::restart_kernel which looked like a good place.

If we can find a solution for Point 1, I could reuse the pattern and implement the logic to poll in RemoteMappingKernelManager which both shutdown and restart can use.

If Point 1 can not be solved, than I can extend the KernelActionHandler in EG and write the polling logic there.

Please let me know on how can proceed further.

@kevin-bates
Copy link
Member

I've been giving this some thought over the last couple of days. Basically, this logic should be colocated and the question is where should that occur.

I view this PR as another implementation of pending kernels, but one that doesn't use a Future to determine that the pending portion of things has completed. Since EG cannot leverage that functionality (at this time) we're faced with rolling our own temporarily until we can leverage pending kernels. (This will require EG's transition to kernel provisioners.)

So, if we were to roll our own temporary solution, it seems clear that we'd want to do this in similar locations as what is done in jupyter_client v7 (albeit in the KernelManager subclasses we implement) and, preferably, in a nearly identical manner if possible. This last comment implies the use of AsyncKernelManager.

The EG KernelManager seems to be right colocation IMO because its the EG KernelManager overrides that will essentially dissolve once kernel provisioner support is achieved. Yes, we will likely still have subclasses to accomplish "enterprise" kinds of functionality - like HA/DR, load-balancing, etc.

This all said, I think it would be good to get together and hash this out further. Let's chat via gitter and come up with a time we can meet. Thank you.


async def wait_and_poll_for_restart_to_complete(self, kernel_id, action="shutdown"):
kernel = self.get_kernel(kernel_id)
start_time = int(time.time()) # epoc time in seconds
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should tolerate milliseconds since we'll want to allow for a sub-second interval.

Suggested change
start_time = int(time.time()) # epoc time in seconds
start_time: float = time.time() # epoc time

except KeyError as ke: # this is hit for multiple shutdown request.
self.log.exception(f"Exception while shutting Kernel: {kernel_id}: {ke}")

async def wait_and_poll_for_restart_to_complete(self, kernel_id, action="shutdown"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a little verbose, perhaps something more like:

Suggested change
async def wait_and_poll_for_restart_to_complete(self, kernel_id, action="shutdown"):
async def wait_for_restart(self, kernel_id: str, action:str = "shutdown"):

@@ -186,6 +190,50 @@ async def start_kernel(self, *args, **kwargs):
self.parent.kernel_session_manager.create_session(kernel_id, **kwargs)
return kernel_id

async def restart_kernel(self, kernel_id):
kernel = self.get_kernel(kernel_id)
self.log.debug(f"Current value of the Kernel Restarting: {kernel.restarting}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we please lowercase Kernel and Restarting in its various places? Might as well reference the variable names as well.

Suggested change
self.log.debug(f"Current value of the Kernel Restarting: {kernel.restarting}")
self.log.debug(f"Current value of the 'kernel.restarting': {kernel.restarting}")

@rahul26goyal
Copy link
Contributor Author

hi @kevin-bates :
we tested the race condition on v2.6.0 without this fix and we still see the FD leak happening with Kubernetes Kernels.

Copy link
Member

@kevin-bates kevin-bates left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Rahul - thanks for doing this and the tremendous troubleshooting efforts this took - this is not easy to diagnose!

I had some relatively minor comments regarding the logging of information. If you have found certain statements I asked to be removed to be essential, please feel free to push back and we can discuss. Thanks.

Comment on lines 200 to 203
self.log.info(
f"Done with waiting for restart to complete. Current value of kernel.restarting: {kernel.restarting}. "
f"Skipping kernel restart."
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove this. We can infer this from any logging the wait_for_restart_finish() does.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense.. ll remove these/

f"Skipping kernel restart."
)
return
self.log.info("Going ahead to process kernel restart request.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

kernel.restarting = True # Moved in out of RemoteKernelManager
await super().restart_kernel(kernel_id)
finally:
self.log.debug("Resetting kernel.restarting flag to False.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is very helpful, please remove.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure


async def shutdown_kernel(self, kernel_id, now=False, restart=False):
kernel = self.get_kernel(kernel_id)
self.log.debug(f"Current value of the Kernel Restarting: {kernel.restarting}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be inferred. Please remove.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

try:
await super().shutdown_kernel(kernel_id, now, restart)
except KeyError as ke: # this is hit for multiple shutdown request.
self.log.exception(f"Exception while shutting Kernel: {kernel_id}: {ke}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.log.exception(f"Exception while shutting Kernel: {kernel_id}: {ke}")
self.log.exception(f"Exception while shutting down kernel: '{kernel_id}': {ke}")

Shouldn't this be re-raised? Or perhaps only if 'restarting' == False?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this usually happens when we have sent duplicate kernel shutdown request while the kernel was still restarting.
I will testing this by raising the exception and get back with the behaviour.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @kevin-bates : I tested various scenarios for this:

  1. The default behaviour of the JEG when multiple shutdown request is sent for the same kernel ID is except the first requests which returns 204 after shutting down the kernel, the other requests get en exception while popping out the kernel_id from MultiKernelManager._kernel dict. This leads to KeyError: 'b0b149e8-8a22-48db-b8b6-ed90e441ac41' and the final response thrown to the clients is HTTP 500: 500 DELETE /api/kernels/b0b149e8-8a22-48db-b8b6-ed90e441ac41.

  2. With the current code change in place, we are handling the exception gracefully and returning a HTTP response 204. This behaviour is different from the point 1 above but this is similar to to the scenario where if you had sent a shutdown request for the kernel_id which does not exist. https://github.com/jupyter-server/jupyter_server/blob/main/jupyter_server/services/kernels/kernelmanager.py#L654

let me know how to proceed forward with this.
I feel in order to keep the behaviour same, we can skip the graceful handling and let it throw 500 to user as before.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we please update the text of the log statement to the suggested value? It doesn't adhere to the conventions of the others (lower-case kernel, quoted kernel_id, and adds 'down' to complete the action).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I just realized I didn't answer your question above.

Hmm. The timing issue you describe goes beyond the current checks made in the _check_kernel_id() override in which we'll return a 404 if the kernel_id is unknown to the MultiKernelManager - correct? (I had fixed one such race condition a while okay related to the _kernel_connections list.)

So I guess a similar issue still exists wrt to _kernels and the try/except KeyError block is a catch for that. (You're using the latest server code with AsyncKernelManagement - correct?)

I like the fact that we'd no longer raise a 500 in this case, and most cases of this nature will result in a 404 due to the _check_kernel_id() override. I think it could be misleading to return 204 when such requests did not delete the resource (kernel) and believe the best status, in this case, would be 404 - since that's the truth and essentially exhibits similar behavior to when the kernel-id is not managed (which is also true).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes @kevin-bates ..your observations are correct and I am testing with AsyncKernelManagement.
The next step here is to raise a 404 similar to check_kernel_id.

raise web.HTTPError(404, "Kernel does not exist: %s" % kernel_id)

Also, should I further raise a CR in AsyncMappingKernelManager::shutdown_kernel to handle the KeyError exception in jupyter server repo?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under normal circumstances, I'd say, yeah, if you can reproduce it in that environment. But that environment (outside of EG) will be using the pending kernel support - in which case I'd be surprised this can be reproduced. So, it comes down to, "is it worth a patch to jupyter_client 6.x?" and I'd say, let the status quo flow and live with the fact that 500 will be returned in this relatively rare scenario. If we find issues here when converting to provisions, we can tackle this then, but this particular area of the stack is still evolving (e.g., pending kernels will likely be subsumed by the state machine work).

Comment on lines 228 to 232
self.log.info(
f"kernel is restarting when {action} request received. Polling every {poll_time} "
f"seconds for next {timeout} seconds for kernel '{kernel_id}'"
f" to complete its restart."
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.log.info(
f"kernel is restarting when {action} request received. Polling every {poll_time} "
f"seconds for next {timeout} seconds for kernel '{kernel_id}'"
f" to complete its restart."
)
self.log.info(
f"Kernel '{kernel_id}' was restarting when {action} request received. Polling every {poll_time} "
f"seconds for next {timeout} seconds for kernel to complete its restart."
)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good. ll update.

Comment on lines 236 to 238
self.log.info(
f"Wait_Timeout: Existing out of the restart wait loop to {action} kernel."
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's change this to debug...

Suggested change
self.log.info(
f"Wait_Timeout: Existing out of the restart wait loop to {action} kernel."
)
self.log.debug(
f"Timeout: Exiting restart wait loop in order to {action} kernel '{kernel_id}'."
)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can change this to debug but I think this is an important log line that needs to be visible by default. Pls give this another thought and let me know .

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Leaving at INFO seems ok. I'm not sure how useful it is other than to perhaps get an idea that a race condition occurred during restart/restart or restart/shutdown. That said, if there is noise, this might be something for operators to look into.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes @kevin-bates .. this will be helpful to debug any new issue that might arise due to this new code flow.

Comment on lines 241 to 243
self.log.info(
f"Returning with current value of the kernel.restarting: {kernel.restarting}."
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.log.info(
f"Returning with current value of the kernel.restarting: {kernel.restarting}."
)
self.log.debug(
f"Returning from restart-wait with kernel.restarting value: {kernel.restarting} for kernel '{kernel_id}'."
)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this also can be inferred ..so removing this log line completely.

Comment on lines 471 to 472
if env.get("KERNEL_LAUNCH_TIMEOUT", None):
self.kernel_launch_timeout = float(env.get("KERNEL_LAUNCH_TIMEOUT"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can scrap the if statement...

Suggested change
if env.get("KERNEL_LAUNCH_TIMEOUT", None):
self.kernel_launch_timeout = float(env.get("KERNEL_LAUNCH_TIMEOUT"))
self.kernel_launch_timeout = float(env.get("KERNEL_LAUNCH_TIMEOUT", default_kernel_launch_timeout))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

L398, I have already initialised with the default value. So, I skipped setting this again.
let me know if we still need to add this here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand. It just seems cleaner and more maintainable to not have the extra if statement IMO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay .. ll make the change

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gentle ping regarding the suggested change.

Comment on lines 565 to 568
# this is added for only auto-restarter as it directly call this method.
self.log.info(f"restarting kernel with value for now: {now}")
if now:
self.restarting = True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this logging is that helpful. When possible, I'd rather have the comment specific to the action. Where it is makes it tough for someone reading the code know what "this is added" is referring to. They might think its the larger block of code.

Suggested change
# this is added for only auto-restarter as it directly call this method.
self.log.info(f"restarting kernel with value for now: {now}")
if now:
self.restarting = True
if now: # if auto-restarting (when now is True), indicate we're restarting.
self.restarting = True

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. agree kevin.

Copy link
Member

@kevin-bates kevin-bates left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting closer. One typo and a couple of suggestions.

@@ -18,6 +19,9 @@
from ..processproxies.processproxy import LocalProcessProxy, RemoteProcessProxy
from ..sessions.kernelsessionmanager import KernelSessionManager

default_kernel_launch_timeout = float(os.getenv("EG_KERNEL_LAUNCH_TIMEOUT", "30"))
kernel_restart_finish_poll_internal = float(os.getenv("EG_RESTART_FINISH_POLL_INTERVAL", 1.0))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo

Suggested change
kernel_restart_finish_poll_internal = float(os.getenv("EG_RESTART_FINISH_POLL_INTERVAL", 1.0))
kernel_restart_finish_poll_interval = float(os.getenv("EG_RESTART_FINISH_POLL_INTERVAL", 1.0))

try:
await super().shutdown_kernel(kernel_id, now, restart)
except KeyError as ke: # this is hit for multiple shutdown request.
self.log.exception(f"Exception while shutting Kernel: {kernel_id}: {ke}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we please update the text of the log statement to the suggested value? It doesn't adhere to the conventions of the others (lower-case kernel, quoted kernel_id, and adds 'down' to complete the action).

Comment on lines 471 to 472
if env.get("KERNEL_LAUNCH_TIMEOUT", None):
self.kernel_launch_timeout = float(env.get("KERNEL_LAUNCH_TIMEOUT"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gentle ping regarding the suggested change.

Copy link
Member

@kevin-bates kevin-bates left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good @rahul26goyal - thank you!
Could you please remove "(WIP)" from the title once you feel this is ready (which I'm assuming is the case)?

@rahul26goyal rahul26goyal changed the title Attempt to fix FD leak issue #1051 (WIP) Attempt to fix FD leak issue #1051 May 16, 2022
@rahul26goyal
Copy link
Contributor Author

hi @kevin-bates
the changes are good to merge now.

@kevin-bates
Copy link
Member

Great - thanks for the response.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants