-
Notifications
You must be signed in to change notification settings - Fork 221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Attempt to fix FD leak issue #1051 #1054
Attempt to fix FD leak issue #1051 #1054
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @rahul26goyal. As I mentioned on the issue, it would be great if you could spend some time with EG 2.6 to reproduce this. This PR is essentially a poll loop and strikes me as analogous to the "pending kernels" work in jupyter_client >= 7 contains (and EG can't leverage). That said, if we find it helps your situation, it seems fairly harmless to include.
@@ -133,6 +135,33 @@ def get(self, kernel_id): | |||
model = km.kernel_model(kernel_id) | |||
self.finish(json.dumps(model, default=date_default)) | |||
|
|||
@web.authenticated | |||
async def delete(self, kernel_id): | |||
self.log.info(f"Received Shutdown for Kernel : {kernel_id}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please convert to debug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
self.log.info(f"Going to Poll every {poll_time} seconds for next {timeout} " | ||
f"seconds for Kernel to come out of restart.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's adjust the text to sound more definitive. How about...
self.log.info(f"Going to Poll every {poll_time} seconds for next {timeout} " | |
f"seconds for Kernel to come out of restart.") | |
self.log.info(f"Kernel is restarting when shutdown request received. Polling every {poll_time} seconds for next {timeout} " | |
f"seconds for kernel '{kernel_id}' to complete its restart, then will proceed with its shutdown.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good 👍
while kernel.restarting: | ||
now = int(time.time()) | ||
if (now - start_time) > timeout: | ||
self.log.info("Existing out of the shutdown wait loop to terminate the kernel anyways.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about
self.log.info("Existing out of the shutdown wait loop to terminate the kernel anyways.") | |
self.log.info(f"Restart timeout has been exceeded for kernel '{kernel_id}'. Proceeding with shutdown.") |
break | ||
self.log.info(f"going to sleep for {poll_time}") # TODO remove this. | ||
await asyncio.sleep(poll_time) | ||
time.sleep(5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be removed as it blocks the server (and is redundant).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my bad..it was a test code which slipped into this PR.
if kernel.restarting: | ||
start_time = int(time.time()) # epoc time in seconds | ||
timeout = km.kernel_info_timeout # this could be set to kernel launch timeout to be in sync! | ||
poll_time = 5 # we can make this configurable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. I think an EG_
-prefixed env is probably sufficient for this - rather than a full-blown configurable.
I'm wondering if we should use a smaller value (like 1.0 second) so we can detect the restart's completion sooner.
(If configured via an env, let's make sure it handles floats)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree.
- We can have smaller default value for
poll_time
to detect restart sooner while providing an override option.
kernel = km.get_kernel(kernel_id) | ||
if kernel.restarting: | ||
start_time = int(time.time()) # epoc time in seconds | ||
timeout = km.kernel_info_timeout # this could be set to kernel launch timeout to be in sync! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we have access to the value but the kernel-launch-timeout would probably be more appropriate since restart is s superset of launch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have tried incorporated this fix by moving the kernel_launch_timeout
as a Kernel property which seems the right place for it?
Thanks a lot for the review comments @kevin-bates. I will go over the comments today and address those.
|
4f15f4f
to
13c1b7e
Compare
Hi @rahul26goyal. Is there a reason you chose to add the duplicate restart logic in the |
I agree that its a duplicate code and a definite pattern being followed,
If we can find a solution for Point 1, I could reuse the pattern and implement the logic to poll in If Point 1 can not be solved, than I can extend the Please let me know on how can proceed further. |
I've been giving this some thought over the last couple of days. Basically, this logic should be colocated and the question is where should that occur. I view this PR as another implementation of pending kernels, but one that doesn't use a Future to determine that the pending portion of things has completed. Since EG cannot leverage that functionality (at this time) we're faced with rolling our own temporarily until we can leverage pending kernels. (This will require EG's transition to kernel provisioners.) So, if we were to roll our own temporary solution, it seems clear that we'd want to do this in similar locations as what is done in jupyter_client v7 (albeit in the KernelManager subclasses we implement) and, preferably, in a nearly identical manner if possible. This last comment implies the use of AsyncKernelManager. The EG KernelManager seems to be right colocation IMO because its the EG KernelManager overrides that will essentially dissolve once kernel provisioner support is achieved. Yes, we will likely still have subclasses to accomplish "enterprise" kinds of functionality - like HA/DR, load-balancing, etc. This all said, I think it would be good to get together and hash this out further. Let's chat via gitter and come up with a time we can meet. Thank you. |
7575dad
to
8281154
Compare
|
||
async def wait_and_poll_for_restart_to_complete(self, kernel_id, action="shutdown"): | ||
kernel = self.get_kernel(kernel_id) | ||
start_time = int(time.time()) # epoc time in seconds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should tolerate milliseconds since we'll want to allow for a sub-second interval.
start_time = int(time.time()) # epoc time in seconds | |
start_time: float = time.time() # epoc time |
except KeyError as ke: # this is hit for multiple shutdown request. | ||
self.log.exception(f"Exception while shutting Kernel: {kernel_id}: {ke}") | ||
|
||
async def wait_and_poll_for_restart_to_complete(self, kernel_id, action="shutdown"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems a little verbose, perhaps something more like:
async def wait_and_poll_for_restart_to_complete(self, kernel_id, action="shutdown"): | |
async def wait_for_restart(self, kernel_id: str, action:str = "shutdown"): |
@@ -186,6 +190,50 @@ async def start_kernel(self, *args, **kwargs): | |||
self.parent.kernel_session_manager.create_session(kernel_id, **kwargs) | |||
return kernel_id | |||
|
|||
async def restart_kernel(self, kernel_id): | |||
kernel = self.get_kernel(kernel_id) | |||
self.log.debug(f"Current value of the Kernel Restarting: {kernel.restarting}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we please lowercase Kernel
and Restarting
in its various places? Might as well reference the variable names as well.
self.log.debug(f"Current value of the Kernel Restarting: {kernel.restarting}") | |
self.log.debug(f"Current value of the 'kernel.restarting': {kernel.restarting}") |
8281154
to
3365170
Compare
f0b0a01
to
15e82ab
Compare
hi @kevin-bates : |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Rahul - thanks for doing this and the tremendous troubleshooting efforts this took - this is not easy to diagnose!
I had some relatively minor comments regarding the logging of information. If you have found certain statements I asked to be removed to be essential, please feel free to push back and we can discuss. Thanks.
self.log.info( | ||
f"Done with waiting for restart to complete. Current value of kernel.restarting: {kernel.restarting}. " | ||
f"Skipping kernel restart." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove this. We can infer this from any logging the wait_for_restart_finish()
does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense.. ll remove these/
f"Skipping kernel restart." | ||
) | ||
return | ||
self.log.info("Going ahead to process kernel restart request.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
kernel.restarting = True # Moved in out of RemoteKernelManager | ||
await super().restart_kernel(kernel_id) | ||
finally: | ||
self.log.debug("Resetting kernel.restarting flag to False.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is very helpful, please remove.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
|
||
async def shutdown_kernel(self, kernel_id, now=False, restart=False): | ||
kernel = self.get_kernel(kernel_id) | ||
self.log.debug(f"Current value of the Kernel Restarting: {kernel.restarting}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can be inferred. Please remove.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
try: | ||
await super().shutdown_kernel(kernel_id, now, restart) | ||
except KeyError as ke: # this is hit for multiple shutdown request. | ||
self.log.exception(f"Exception while shutting Kernel: {kernel_id}: {ke}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.log.exception(f"Exception while shutting Kernel: {kernel_id}: {ke}") | |
self.log.exception(f"Exception while shutting down kernel: '{kernel_id}': {ke}") |
Shouldn't this be re-raised? Or perhaps only if 'restarting' == False?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this usually happens when we have sent duplicate kernel shutdown request while the kernel was still restarting.
I will testing this by raising the exception and get back with the behaviour.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @kevin-bates : I tested various scenarios for this:
-
The default behaviour of the JEG when multiple
shutdown
request is sent for the same kernel ID is except the first requests which returns 204 after shutting down the kernel, the other requests get en exception whilepopping
out thekernel_id
fromMultiKernelManager._kernel
dict. This leads toKeyError: 'b0b149e8-8a22-48db-b8b6-ed90e441ac41'
and the final response thrown to the clients is HTTP 500:500 DELETE /api/kernels/b0b149e8-8a22-48db-b8b6-ed90e441ac41
. -
With the current code change in place, we are handling the exception gracefully and returning a HTTP response 204. This behaviour is different from the point 1 above but this is similar to to the scenario where if you had sent a
shutdown
request for thekernel_id
which does not exist. https://github.com/jupyter-server/jupyter_server/blob/main/jupyter_server/services/kernels/kernelmanager.py#L654
let me know how to proceed forward with this.
I feel in order to keep the behaviour same, we can skip the graceful handling and let it throw 500
to user as before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we please update the text of the log statement to the suggested value? It doesn't adhere to the conventions of the others (lower-case kernel
, quoted kernel_id
, and adds 'down' to complete the action).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I just realized I didn't answer your question above.
Hmm. The timing issue you describe goes beyond the current checks made in the _check_kernel_id()
override in which we'll return a 404
if the kernel_id is unknown to the MultiKernelManager
- correct? (I had fixed one such race condition a while okay related to the _kernel_connections
list.)
So I guess a similar issue still exists wrt to _kernels
and the try/except KeyError
block is a catch for that. (You're using the latest server code with AsyncKernelManagement
- correct?)
I like the fact that we'd no longer raise a 500 in this case, and most cases of this nature will result in a 404
due to the _check_kernel_id()
override. I think it could be misleading to return 204
when such requests did not delete the resource (kernel) and believe the best status, in this case, would be 404
- since that's the truth and essentially exhibits similar behavior to when the kernel-id is not managed (which is also true).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes @kevin-bates ..your observations are correct and I am testing with AsyncKernelManagement
.
The next step here is to raise a 404
similar to check_kernel_id.
raise web.HTTPError(404, "Kernel does not exist: %s" % kernel_id)
Also, should I further raise a CR in AsyncMappingKernelManager::shutdown_kernel
to handle the KeyError
exception in jupyter server repo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Under normal circumstances, I'd say, yeah, if you can reproduce it in that environment. But that environment (outside of EG) will be using the pending kernel support - in which case I'd be surprised this can be reproduced. So, it comes down to, "is it worth a patch to jupyter_client 6.x?" and I'd say, let the status quo flow and live with the fact that 500
will be returned in this relatively rare scenario. If we find issues here when converting to provisions, we can tackle this then, but this particular area of the stack is still evolving (e.g., pending kernels will likely be subsumed by the state machine work).
self.log.info( | ||
f"kernel is restarting when {action} request received. Polling every {poll_time} " | ||
f"seconds for next {timeout} seconds for kernel '{kernel_id}'" | ||
f" to complete its restart." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.log.info( | |
f"kernel is restarting when {action} request received. Polling every {poll_time} " | |
f"seconds for next {timeout} seconds for kernel '{kernel_id}'" | |
f" to complete its restart." | |
) | |
self.log.info( | |
f"Kernel '{kernel_id}' was restarting when {action} request received. Polling every {poll_time} " | |
f"seconds for next {timeout} seconds for kernel to complete its restart." | |
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good. ll update.
self.log.info( | ||
f"Wait_Timeout: Existing out of the restart wait loop to {action} kernel." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's change this to debug...
self.log.info( | |
f"Wait_Timeout: Existing out of the restart wait loop to {action} kernel." | |
) | |
self.log.debug( | |
f"Timeout: Exiting restart wait loop in order to {action} kernel '{kernel_id}'." | |
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can change this to debug but I think this is an important log line that needs to be visible by default. Pls give this another thought and let me know .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Leaving at INFO seems ok. I'm not sure how useful it is other than to perhaps get an idea that a race condition occurred during restart/restart or restart/shutdown. That said, if there is noise, this might be something for operators to look into.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes @kevin-bates .. this will be helpful to debug any new issue that might arise due to this new code flow.
self.log.info( | ||
f"Returning with current value of the kernel.restarting: {kernel.restarting}." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.log.info( | |
f"Returning with current value of the kernel.restarting: {kernel.restarting}." | |
) | |
self.log.debug( | |
f"Returning from restart-wait with kernel.restarting value: {kernel.restarting} for kernel '{kernel_id}'." | |
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this also can be inferred ..so removing this log line completely.
if env.get("KERNEL_LAUNCH_TIMEOUT", None): | ||
self.kernel_launch_timeout = float(env.get("KERNEL_LAUNCH_TIMEOUT")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can scrap the if statement...
if env.get("KERNEL_LAUNCH_TIMEOUT", None): | |
self.kernel_launch_timeout = float(env.get("KERNEL_LAUNCH_TIMEOUT")) | |
self.kernel_launch_timeout = float(env.get("KERNEL_LAUNCH_TIMEOUT", default_kernel_launch_timeout)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
L398, I have already initialised with the default value. So, I skipped setting this again.
let me know if we still need to add this here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand. It just seems cleaner and more maintainable to not have the extra if
statement IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay .. ll make the change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gentle ping regarding the suggested change.
# this is added for only auto-restarter as it directly call this method. | ||
self.log.info(f"restarting kernel with value for now: {now}") | ||
if now: | ||
self.restarting = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this logging is that helpful. When possible, I'd rather have the comment specific to the action. Where it is makes it tough for someone reading the code know what "this is added" is referring to. They might think its the larger block of code.
# this is added for only auto-restarter as it directly call this method. | |
self.log.info(f"restarting kernel with value for now: {now}") | |
if now: | |
self.restarting = True | |
if now: # if auto-restarting (when now is True), indicate we're restarting. | |
self.restarting = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. agree kevin.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Getting closer. One typo and a couple of suggestions.
@@ -18,6 +19,9 @@ | |||
from ..processproxies.processproxy import LocalProcessProxy, RemoteProcessProxy | |||
from ..sessions.kernelsessionmanager import KernelSessionManager | |||
|
|||
default_kernel_launch_timeout = float(os.getenv("EG_KERNEL_LAUNCH_TIMEOUT", "30")) | |||
kernel_restart_finish_poll_internal = float(os.getenv("EG_RESTART_FINISH_POLL_INTERVAL", 1.0)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo
kernel_restart_finish_poll_internal = float(os.getenv("EG_RESTART_FINISH_POLL_INTERVAL", 1.0)) | |
kernel_restart_finish_poll_interval = float(os.getenv("EG_RESTART_FINISH_POLL_INTERVAL", 1.0)) |
try: | ||
await super().shutdown_kernel(kernel_id, now, restart) | ||
except KeyError as ke: # this is hit for multiple shutdown request. | ||
self.log.exception(f"Exception while shutting Kernel: {kernel_id}: {ke}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we please update the text of the log statement to the suggested value? It doesn't adhere to the conventions of the others (lower-case kernel
, quoted kernel_id
, and adds 'down' to complete the action).
if env.get("KERNEL_LAUNCH_TIMEOUT", None): | ||
self.kernel_launch_timeout = float(env.get("KERNEL_LAUNCH_TIMEOUT")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gentle ping regarding the suggested change.
f3e7b2c
to
84127bb
Compare
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good @rahul26goyal - thank you!
Could you please remove "(WIP)" from the title once you feel this is ready (which I'm assuming is the case)?
hi @kevin-bates |
Great - thanks for the response. |
Description
This PR is raised wrt the fix proposed in #1051 to wait on the Kernel Shutdown request until the kernel Restart is completed.
How does this prevent the FD leak mentioned on #1051 ?
kernel_info
is minimal.TESTING
I have done some basic sanity testing with notebook and JupyterLab and it looks to be working fine.
Tested the changes on local python kernel and changes seems to be working as expected.
when a restart request is followed by shutdown, the shutdown request waits until the restart is completed and then shuts down the kernel cleanly.
when a restart request is followed by shutdown and the shutdown times out waiting, it proceeds to forceful shutdown. This was the same behaviour before the current changes were made.
when multiple restart request is received for the same kernel, the first shutdown request goes the actual restart while all the other duplicate request wait for the
kernel.restarting
to becomeFalse
and return.thanks!
Test Scenario : Kernel is restarted via "auto restarter". And then kernel restart request is sent.