Need ability to identify dormant kernels to quiesce them #96

enigmata · 2016-02-08T16:19:17Z

In the context of a service that needs to efficiently manage jupyter-kernel-gateways for users, it is necessary to be able to monitor kernels to ascertain a sufficient period of inactivity, thereby triggering kernel shutdown.

In the bigger picture, where a kernel-gateway is provisioned for each user, it is necessary to quiesce that user's kernel-gateway when they've "stepped away" from the API and left their kernels inactive for an extended period of time; gateway can be brought back up when they return.

Since kernels can drive work outside of jupyter/gateway, there's more involved than just monitoring the kernel-gateway. But for the kernel-gateway, must be able to monitor for last real activity in a kernel, say timestamp. So, it would help to have an API that would return, for each kernel provisioned, the last time there was "activity" through the kernel. Not sure about the best way to handle no kernels; if nothing returned, no way to determine how long since kernels were running unless state handled outside of the gateway.

parente · 2016-02-08T17:34:32Z

Thanks for opening this @rwhorman. The request is reasonable from an admin/devops perspective. The challenge will be getting the hooks into the right spots to watch for traffic to/from kernels via the websocket connections, including idle/busy status indications. This has partially been done with jupyter/http-configurable-proxy in the past because adding the capability to the notebook server itself is a bit of place. Here in kernel gateway, which is meant to be a programmatic API for kernels, though, it makes more sense IMHO.

We'll give it a shot.

Strawman API in jupyter-websocket mode: /_api/activity which is protected by an admin token and returns a JSON response like so:

{
  "0f41b09c-c5b4-4f28-a7db-2f779151c20f": {
    "last_activity": "2014-09-08T19:40:17.819Z",
    "busy" : false
  }
}

where:

the GUID key is the kernel guid
last_activity is the last known ISO-8601 date/time traffic left or entered the kernel
busy indicates whether the kernel is busy or not based on its last iopub idle/busy status message

Alternative: Add the metadata right into the /api/kernel[/:id] response, but then we're extending the notebook server defined API.

parente · 2016-02-08T21:36:55Z

Probably the place to try would be to write a new WebsocketHandler override in the https://github.com/jupyter-incubator/kernel_gateway/blob/master/kernel_gateway/services/kernels/handlers.py module. In the new class, override on_message to watch traffic from a client to a kernel and _on_zmq_reply to watch traffic from a kernel to a client.

Define a new ActivityManager class, instantiate it in the gateway app, and pass it as settings available to both the new activty handler for /_api/activity and the websocket kernel handler override.

lbustelo · 2016-02-09T15:58:45Z

@rwhorman Any concerns that the KG will hold this activity information in memory? Are you shutting down KGs independent of kernels?

Some interesting scenarios:

User executes some code on kernel that leaves a thread that pushes data back through the iopub channel. User then leaves. Is this a case where we would like to terminate the KG? If so, we should not track traffic that left or entered the kernel, rather just track the idle/busy state changes.
User executes some long running code on the kernel. User then leaves. Kernel is done in the middle of the night. User comes back later the next day. How can we identify this pattern so that the user can return and view results? Are you thinking of bringing down KG while a long running job is taking place in the kernel? Maybe not a big issue if you are managing kernels separate from KG.

How about this structure below:

{
  "0f41b09c-c5b4-4f28-a7db-2f779151c20f": {
    "last_time_state_changed": "2014-09-08T19:40:17.819Z",
    "busy" : false
  }
}

where:

last_time_state_changed is the last time the state change from idle to busy and vice versa.
busy still would need to track the current state.

In the above scenarios, when I refer to User then leaves I'm assuming that they just left the computers, but still have notebook or application running.

Lull3rSkat3r · 2016-02-09T16:08:57Z

Something that may help users decide on their specific use case is to provide more information:

{
  "0f41b09c-c5b4-4f28-a7db-2f779151c20f": {
    "last_message_to_client": "2014-09-08T19:40:17.819Z",
    "last_message_to_kernel": "2014-09-08T19:40:17.819Z",
    "last_time_state_changed": "2014-09-08T19:40:17.819Z",
    "busy" : false,
    "connections" : 0,
    "last_client_connect": "2014-09-08T19:40:17.819Z",
    "last_client_disconnect": "2014-09-08T19:40:17.819Z"
  }
}

last_message_to_client - The last time a client sent a message to the kernel
last_message_to_kernel - The last time a message was sent from the kernel to the client
last_client_connect - The last time a client websocket connected to /api/kernels/:kernel_id/channels
last_client_disconnect - The last time a client websocket disconnected from /api/kernels/:kernel_id/channels
connections - The number of websocket connections to the kernel gateway

Lull3rSkat3r · 2016-02-09T19:44:17Z

#97 is a WIP, but a base implementation is there for /_api/activity. When we finalize what we want returned, we can easily add those valies.

parente · 2016-02-09T19:51:31Z

Nice. Glad it wasn't overly complicated. Some tests will help too once the values are defined.

Lull3rSkat3r · 2016-02-09T20:06:40Z

@parente, yes on the tests just waiting for us to finalize on what we want.

enigmata · 2016-02-10T00:47:52Z

@Lull3rSkat3r I like what you did there w/ additional info. I can image ways to utilize the info to smart-track, and potentially optimize the deactivation logic. ... and I can imagine how we may consider feeding some of this back to the user to inform why deactivation happened; you know, those developers as end users ;-)

enigmata · 2016-02-10T00:57:21Z

@lbustelo In-memory is cool. We only care while KG is up, and it comes down when we say so .. i.e. when kernels are all gone or we deem them to be sufficiently inactive.

scenario 1: No, we wouldn't terminate a kernel if they are pushing data thru. A ticking meter is a happy meter in the cloud services world ;-)

scenario 2: I view this as falling under best practices, in that long-running jobs need to write results to a storage service that they subsequently reconnect to and query. Another relevant aspect here is that the deactivation timeout can be tuned so as to catch the 80-20; for overnight use cases, we can set to 12 hrs or so ... heck, even 24hrs is not a bad TTL. What we wouldn't want is somebody sitting on an idle KG-kernels for multiple days

lbustelo · 2016-02-10T20:19:58Z

@rwhorman sounds good to me.

Lull3rSkat3r · 2016-02-11T23:08:58Z

PR #97 has all the fields in mine and @lbustelo's comments. @rwhorman, let me know if this works for you.

enigmata · 2016-02-14T16:28:01Z

looks good to me @Lull3rSkat3r ! much appreciate the very quick turnaround!

parente assigned Lull3rSkat3r Feb 8, 2016

parente added the enhancement label Feb 10, 2016

Lull3rSkat3r mentioned this issue Feb 12, 2016

Activity handler API #97

Merged

parente modified the milestone: 0.4.0 Feb 12, 2016

parente closed this as completed in #97 Feb 15, 2016

parente mentioned this issue Mar 17, 2016

Swagger spec at /api/spec.yml #135

Closed

2 tasks

parente mentioned this issue Oct 12, 2016

add activity watching to kernels jupyter/notebook#1827

Merged

This was referenced Feb 16, 2017

Add ability to cull idle kernels after specified period #226

Closed

Cull idle kernels jupyter/notebook#2215

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Need ability to identify dormant kernels to quiesce them #96

Need ability to identify dormant kernels to quiesce them #96

enigmata commented Feb 8, 2016

parente commented Feb 8, 2016

parente commented Feb 8, 2016

lbustelo commented Feb 9, 2016

Lull3rSkat3r commented Feb 9, 2016

Lull3rSkat3r commented Feb 9, 2016

parente commented Feb 9, 2016

Lull3rSkat3r commented Feb 9, 2016

enigmata commented Feb 10, 2016

enigmata commented Feb 10, 2016

lbustelo commented Feb 10, 2016

Lull3rSkat3r commented Feb 11, 2016

enigmata commented Feb 14, 2016

Need ability to identify dormant kernels to quiesce them #96

Need ability to identify dormant kernels to quiesce them #96

Comments

enigmata commented Feb 8, 2016

parente commented Feb 8, 2016

parente commented Feb 8, 2016

lbustelo commented Feb 9, 2016

Lull3rSkat3r commented Feb 9, 2016

Lull3rSkat3r commented Feb 9, 2016

parente commented Feb 9, 2016

Lull3rSkat3r commented Feb 9, 2016

enigmata commented Feb 10, 2016

enigmata commented Feb 10, 2016

lbustelo commented Feb 10, 2016

Lull3rSkat3r commented Feb 11, 2016

enigmata commented Feb 14, 2016