Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about High Availability for JEG on k8s #1156

Open
chiawchen opened this issue Sep 15, 2022 · 3 comments
Open

Question about High Availability for JEG on k8s #1156

chiawchen opened this issue Sep 15, 2022 · 3 comments
Labels

Comments

@chiawchen
Copy link
Contributor

chiawchen commented Sep 15, 2022

Description

Whenever K8s try to terminate a pod, application will receive a SIGTERM signal [reference], and ideally do the gracefully shutdown; however, I found the line here in JEG,

it will trigger a shutdown to all the existing kernels, thus existing kernel information will be eliminated even if we have external webhook kernel session persistent [reference on JEG doc]. Did I miss anything about handling the restart happened on server side? This may happen quite frequently depends on upgrading sidecar, upgrading some configuration for JEG, of even simply upgrading the hardcoded kernelspec.

Reproduce

  1. Deploy JEG as k8s service with Replication availability & Webhook Kernel Session Persistence
  2. Connect it through jupyterlab and create an arbitrary remote kernel
  3. Delete one of the JEG replica through kubectl delete pod <pod_name>
  4. Observe the remote kernel been deleted instead of preserving for later re-connection

Expected behavior

Shouldn't shutdown remote kernel, but only shutdown local kernel running on JEG (cuz it's impossible to retrieve back the process)

Context

  • Operating System and version: Kubernetes v1.18
  • Browser and version: N/A
  • Jupyter Server version: 1.18.1
  • Jupyter Enterprise Gateway: v3.0.0dev
Troubleshoot Output
Paste the output from running `jupyter troubleshoot` from the command line here.
You may want to sanitize the paths in the output.
Command Line Output
Paste the output from your command line running `jupyter lab` here, use `--debug` if possible.
Browser Output
Paste the output from your browser Javascript console here, if applicable.
@chiawchen chiawchen added the bug label Sep 15, 2022
@kevin-bates
Copy link
Member

Hi @chiawchen - yeah, the HA/DR machinery has not been fully resolved. It is primarily intended for hard failures, behaving more like SIGKILL than SIGTERM, where remote kernels are orphaned.

It makes sense to make the automatic kernel shutdown sensitive to failover configuration, although I wonder if it should be an explicit option (so that we don't always orphan remote kernels), at least for now. Perhaps something like terminate_kernels_on_shutdown that defaults to True and must be explicitly set to False. Operators in configurations that need to perform periodic upgrades would then want to set this. If we find the machinery to be solid, we could then tie this option to the HA modes.

Also note that we now support terminationGracePeriodSeconds in the helm chart.

@chiawchen
Copy link
Contributor Author

chiawchen commented Sep 16, 2022

avoiding orphan remote kernels

make sense for general use case, to prevent this, i think operator side need to have some auto-GC (e.g. 1 week delete all the remote kernel pod) enabled as the final guard

@kevin-bates
Copy link
Member

avoiding orphan remote kernels

make sense for general use case, to prevent this, i think operator side need to have some auto-GC (e.g. 1 week delete all the remote kernel pod) enabled as the final guard

Later last night I realized that, so long as there's another EG instance running at the time the first gets shutdown or even sometime later, and that "other instance" shares the same kernel persistence store (which is assumed in HA configs), then the only kernel pods to be orphaned would be those in which a user never interacts with following the stopped EG's shutdown. That is, even those kernel pods should become active again by virtue of the "hydration" that occurs when a user interacts with their kernel via interrupt or reconnect, etc.

But, yes, we've talked about introducing some admin-related endpoints - one of which could interrogate the kernel persistence store and compare that with the set of managed kernels (somehow checking with each EG instance) and present of list of currently unmanaged kernels. On Kubernetes, this application could present some of the labels, envs, etc. that reside on the kernel pod to help operators better understand whether they should be hydrated or terminated.

This leads me to wonder if kernel provisioners (and perhaps the older, to be obsoleted, process proxies) should expose a method allowing users to access their "metadata" given a kernel_id (or whatever else is necessary to locate the kernel).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants