Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need ability to identify dormant kernels to quiesce them #96

Closed
enigmata opened this issue Feb 8, 2016 · 12 comments
Closed

Need ability to identify dormant kernels to quiesce them #96

enigmata opened this issue Feb 8, 2016 · 12 comments
Assignees
Milestone

Comments

@enigmata
Copy link

enigmata commented Feb 8, 2016

In the context of a service that needs to efficiently manage jupyter-kernel-gateways for users, it is necessary to be able to monitor kernels to ascertain a sufficient period of inactivity, thereby triggering kernel shutdown.

In the bigger picture, where a kernel-gateway is provisioned for each user, it is necessary to quiesce that user's kernel-gateway when they've "stepped away" from the API and left their kernels inactive for an extended period of time; gateway can be brought back up when they return.

Since kernels can drive work outside of jupyter/gateway, there's more involved than just monitoring the kernel-gateway. But for the kernel-gateway, must be able to monitor for last real activity in a kernel, say timestamp. So, it would help to have an API that would return, for each kernel provisioned, the last time there was "activity" through the kernel. Not sure about the best way to handle no kernels; if nothing returned, no way to determine how long since kernels were running unless state handled outside of the gateway.

@parente
Copy link
Contributor

parente commented Feb 8, 2016

Thanks for opening this @rwhorman. The request is reasonable from an admin/devops perspective. The challenge will be getting the hooks into the right spots to watch for traffic to/from kernels via the websocket connections, including idle/busy status indications. This has partially been done with jupyter/http-configurable-proxy in the past because adding the capability to the notebook server itself is a bit of place. Here in kernel gateway, which is meant to be a programmatic API for kernels, though, it makes more sense IMHO.

We'll give it a shot.

Strawman API in jupyter-websocket mode: /_api/activity which is protected by an admin token and returns a JSON response like so:

{
  "0f41b09c-c5b4-4f28-a7db-2f779151c20f": {
    "last_activity": "2014-09-08T19:40:17.819Z",
    "busy" : false
  }
}

where:

  • the GUID key is the kernel guid
  • last_activity is the last known ISO-8601 date/time traffic left or entered the kernel
  • busy indicates whether the kernel is busy or not based on its last iopub idle/busy status message

Alternative: Add the metadata right into the /api/kernel[/:id] response, but then we're extending the notebook server defined API.

@parente
Copy link
Contributor

parente commented Feb 8, 2016

Probably the place to try would be to write a new WebsocketHandler override in the https://github.com/jupyter-incubator/kernel_gateway/blob/master/kernel_gateway/services/kernels/handlers.py module. In the new class, override on_message to watch traffic from a client to a kernel and _on_zmq_reply to watch traffic from a kernel to a client.

Define a new ActivityManager class, instantiate it in the gateway app, and pass it as settings available to both the new activty handler for /_api/activity and the websocket kernel handler override.

@lbustelo
Copy link
Contributor

lbustelo commented Feb 9, 2016

@rwhorman Any concerns that the KG will hold this activity information in memory? Are you shutting down KGs independent of kernels?

Some interesting scenarios:

  1. User executes some code on kernel that leaves a thread that pushes data back through the iopub channel. User then leaves. Is this a case where we would like to terminate the KG? If so, we should not track traffic that left or entered the kernel, rather just track the idle/busy state changes.
  2. User executes some long running code on the kernel. User then leaves. Kernel is done in the middle of the night. User comes back later the next day. How can we identify this pattern so that the user can return and view results? Are you thinking of bringing down KG while a long running job is taking place in the kernel? Maybe not a big issue if you are managing kernels separate from KG.

How about this structure below:

{
  "0f41b09c-c5b4-4f28-a7db-2f779151c20f": {
    "last_time_state_changed": "2014-09-08T19:40:17.819Z",
    "busy" : false
  }
}

where:

  • last_time_state_changed is the last time the state change from idle to busy and vice versa.
  • busy still would need to track the current state.

In the above scenarios, when I refer to User then leaves I'm assuming that they just left the computers, but still have notebook or application running.

@Lull3rSkat3r
Copy link
Collaborator

Something that may help users decide on their specific use case is to provide more information:

{
  "0f41b09c-c5b4-4f28-a7db-2f779151c20f": {
    "last_message_to_client": "2014-09-08T19:40:17.819Z",
    "last_message_to_kernel": "2014-09-08T19:40:17.819Z",
    "last_time_state_changed": "2014-09-08T19:40:17.819Z",
    "busy" : false,
    "connections" : 0,
    "last_client_connect": "2014-09-08T19:40:17.819Z",
    "last_client_disconnect": "2014-09-08T19:40:17.819Z"
  }
}
  • last_message_to_client - The last time a client sent a message to the kernel
  • last_message_to_kernel - The last time a message was sent from the kernel to the client
  • last_client_connect - The last time a client websocket connected to /api/kernels/:kernel_id/channels
  • last_client_disconnect - The last time a client websocket disconnected from /api/kernels/:kernel_id/channels
  • connections - The number of websocket connections to the kernel gateway

@Lull3rSkat3r
Copy link
Collaborator

#97 is a WIP, but a base implementation is there for /_api/activity. When we finalize what we want returned, we can easily add those valies.

@parente
Copy link
Contributor

parente commented Feb 9, 2016

Nice. Glad it wasn't overly complicated. Some tests will help too once the values are defined.

@Lull3rSkat3r
Copy link
Collaborator

@parente, yes on the tests just waiting for us to finalize on what we want.

@enigmata
Copy link
Author

@Lull3rSkat3r I like what you did there w/ additional info. I can image ways to utilize the info to smart-track, and potentially optimize the deactivation logic. ... and I can imagine how we may consider feeding some of this back to the user to inform why deactivation happened; you know, those developers as end users ;-)

@enigmata
Copy link
Author

@lbustelo In-memory is cool. We only care while KG is up, and it comes down when we say so .. i.e. when kernels are all gone or we deem them to be sufficiently inactive.

scenario 1: No, we wouldn't terminate a kernel if they are pushing data thru. A ticking meter is a happy meter in the cloud services world ;-)

scenario 2: I view this as falling under best practices, in that long-running jobs need to write results to a storage service that they subsequently reconnect to and query. Another relevant aspect here is that the deactivation timeout can be tuned so as to catch the 80-20; for overnight use cases, we can set to 12 hrs or so ... heck, even 24hrs is not a bad TTL. What we wouldn't want is somebody sitting on an idle KG-kernels for multiple days

@lbustelo
Copy link
Contributor

@rwhorman sounds good to me.

@Lull3rSkat3r
Copy link
Collaborator

PR #97 has all the fields in mine and @lbustelo's comments. @rwhorman, let me know if this works for you.

@parente parente modified the milestone: 0.4.0 Feb 12, 2016
@enigmata
Copy link
Author

looks good to me @Lull3rSkat3r ! much appreciate the very quick turnaround!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants