Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Availability modes #1095

Merged
merged 8 commits into from
Jun 27, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
149 changes: 149 additions & 0 deletions docs/source/operators/config-availability.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,149 @@
# Availability modes

Enterprise Gateway can be optionally configured in one of two "availability modes": _standalone_ or _replication_. When configured, Enterprise Gateway can recover from failures and reconnect to any active remote kernels that were previously managed by the terminated EG instance. As such, both modes require that kernel session persistence also be enabled via `KernelSessionManager.enable_persistence=True`.

```{note}
Kernel session persistence will be automtically enabled whenever availability mode is configured.
```

```{caution}
**Availability modes and kernel session persistence should be considered experimental!**

Known issues include:
1. Culling configurations do not account for different nodes and therefore could result in the incorrect culling of kernels.
2. Each "node switch" requires a manual reconnect to the kernel.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the above issues only with "active-active" mode and not with "active-passive" mode of EG?

Copy link
Member Author

@kevin-bates kevin-bates Jun 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think reconnecting is necessary for both forms.

Even with "active-active", because we still expect/advise affinity with the managed kernel, you shouldn't run into an issue where the kernel is culled prematurely because it should always stay on the originating node. Only if the affinity is not configured (or not working) could the kernel be culled prematurely from the previous node.

I'll look into some better wording for this, but we should probably better understand where things are with this before merging. Thanks for this comment.

We hope to address these in future releaases (depending on demand).
```

## Standalone availability

_Standalone availability_ assumes that, upon failure of the original EG instance, another EG instance will be started. Upon startup of the second instance (following the termination of the first), EG will attempt to load and reconnect to all kernels that were deemed active when the previous instance terminated. This mode is somewhat analogous to the classic HA/DR mode of _active-passive_ and is typically used when node resources are at a premium or the number of replicas (in the Kubernetes sense) must remain at 1.

To enable Enterprise Gateway for 'standalone' availability, configure `EnterpiseGatewayApp.availability_mode=standalone` or set env `EG_AVAILABILITY_MODE=standalone`.

Here's an example for starting Enterprise Gateway with standalone availability:

```bash
#!/bin/bash

LOG=/var/log/enterprise_gateway.log
PIDFILE=/var/run/enterprise_gateway.pid

jupyter enterprisegateway --ip=0.0.0.0 --port_retries=0 --log-level=DEBUG \
--EnterpriseGatewayApp.availability_mode=standalone > $LOG 2>&1 &

if [ "$?" -eq 0 ]; then
echo $! > $PIDFILE
else
exit 1
fi
```

## Replication availability

With _replication availability_, multiple EG instances (or replicas) are operating at the same time, and fronted with some kind of reverse proxy or load balancer. Because state still resides within each `KernelManager` instance executing within a given EG instance, we strongly suggest configuring some form of _client affinity_ (a.k.a, "sticky session") to avoid node switches wherever possible since each node switch requires manual reconnection of the front-end (today).

```{tip}
Configuring client affinity is **strongly recommended**, otherwise functionality that relies on state within the servicing node (e.g., culling) can be affected upon node switches, resulting in incorrect behavior.
```

In this mode, when one node goes down, the subsequent request will be routed to a different node that doesn't know about the kernel. Prior to returning a `404` (not found) status code, EG will check its persisted store to determine if the kernel was managed and, if so, attempt to "hydrate" a `KernelManager` instance associated with the remote kernel. (Of course, if the kernel was running local to the downed server, chances are it cannot be _revived_.) Upon successful "hydration" the request continues as if on the originating node. Because _client affinity_ is in place, subsequent requests should continue to be routed to the "servicing node".

To enable Enterprise Gateway for 'replication' availability, configure `EnterpiseGatewayApp.availability_mode=replication` or set env `EG_AVAILABILITY_MODE=replication`.

```{attention}
To preserve backwards compatibility, if only kernel session persistence is enabled via `KernelSessionManager.enable_persistence=True`, the availability mode will be automatically configured to 'replication' if `EnterpiseGatewayApp.availability_mode` is not configured.
```

Here's an example for starting Enterprise Gateway with replication availability:

```bash
#!/bin/bash

LOG=/var/log/enterprise_gateway.log
PIDFILE=/var/run/enterprise_gateway.pid

jupyter enterprisegateway --ip=0.0.0.0 --port_retries=0 --log-level=DEBUG \
--EnterpriseGatewayApp.availability_mode=replication > $LOG 2>&1 &

if [ "$?" -eq 0 ]; then
echo $! > $PIDFILE
else
exit 1
fi
```

# Kernel Session Persistence

Enabling kernel session persistence allows Jupyter Notebooks to reconnect to kernels when Enterprise Gateway is restarted and forms the basis for the _availability modes_ described above. Enterprise Gateway provides two ways of persisting kernel sessions: _File Kernel Session Persistence_ and _Webhook Kernel Session Persistence_, although others can be provided by subclassing `KernelSessionManager` (see below).

```{attention}
Due to its experimental nature, kernel session persistence is disabled by default. To enable this functionality, you must configure `KernelSessionManger.enable_persistence=True` or configure `EnterpriseGatewayApp.availability_mode` to either `standalone` or `replication`.
```

As noted above, the availability modes rely on the persisted information relative to the kernel. This information consists of the arguments and options used to launch the kernel, along with its connection information. In essence, it consists of any information necessary to re-establish communication with the kernel.

## File Kernel Session Persistence

File Kernel Session Persistence stores kernel sessions as files in a specified directory. To enable this form of persistence, set the environment variable `EG_KERNEL_SESSION_PERSISTENCE=True` or configure `FileKernelSessionManager.enable_persistence=True`. To change the directory in which the kernel session file is being saved, either set the environment variable `EG_PERSISTENCE_ROOT` or configure `FileKernelSessionManager.persistence_root` to the directory. By default, the directory used to store a given kernel's session information is the `JUPYTER_DATA_DIR`.

```{note}
Because `FileKernelSessionManager` is the default class for kernel session persistence, configuring `EnterpriseGatewayApp.kernel_session_manager_class` to `enterprise_gateway.services.sessions.kernelsessionmanager.FileKernelSessionManager` is not necessary.
```

## Webhook Kernel Session Persistence

Webhook Kernel Session Persistence stores all kernel sessions to any database. In order for this to work, an API must be created. The API must include four endpoints:

- A `GET` that will retrieve a list of all kernel sessions from a database
- A `GET` that will take the kernel id as a path variable and retrieve that information from a database
- A `DELETE` that will delete all kernel sessions, where the body of the request is a list of kernel ids
- A `POST` that will take kernel id as a path variable and kernel session in the body of the request and save it to a database where the object being saved is:

```
{
kernel_id: UUID string,
kernel_session: JSON
}
```

To enable the webhook kernel session persistence, set the environment variable `EG_KERNEL_SESSION_PERSISTENCE=True` or configure `WebhookKernelSessionManager.enable_persistence=True`. To connect the API, set the environment variable `EG_WEBHOOK_URL` or configure `WebhookKernelSessionManager.webhook_url` to the API endpoint.

Because `WebhookKernelSessionManager` is not the default kernel session persistence class, an additional configuration step must be taken to instruct EG to use this class: `EnterpriseGatewayApp.kernel_session_manager_class = enterprise_gateway.services.sessions.kernelsessionmanager.WebhookKernelSessionManager`.

### Enabling Authentication

Enabling authentication is an option if the API requires it for requests. Set the environment variable `EG_AUTH_TYPE` or configure `WebhookKernelSessionManager.auth_type` to be either `Basic` or `Digest`. If it is set to an empty string authentication won't be enabled.

Then set the environment variables `EG_WEBHOOK_USERNAME` and `EG_WEBHOOK_PASSWORD` or configure `WebhookKernelSessionManager.webhook_username` and `WebhookKernelSessionManager.webhook_password` to provide the username and password for authentication.

## Bring Your Own Kernel Session Persistence

To introduce a different implementation, you must configure the kernel session manager class. Here's an example for starting Enterprise Gateway using a custom `KernelSessionManager` and 'standalone' availability. Note that setting `--MyCustomKernelSessionManager.enable_persistence=True` is not necessary because an availability mode is specified, but displayed here for completeness:

```bash
#!/bin/bash

LOG=/var/log/enterprise_gateway.log
PIDFILE=/var/run/enterprise_gateway.pid

jupyter enterprisegateway --ip=0.0.0.0 --port_retries=0 --log-level=DEBUG \
--EnterpriseGatewayApp.kernel_session_manager_class=custom.package.MyCustomKernelSessionManager \
--MyCustomKernelSessionManager.enable_persistence=True \
--EnterpriseGatewayApp.availability_mode=standalone > $LOG 2>&1 &

if [ "$?" -eq 0 ]; then
echo $! > $PIDFILE
else
exit 1
fi
```

Alternative persistence implementations using SQL and NoSQL databases would be ideal and, as always, contributions are welcome!

## Testing Kernel Session Persistence

Once kernel session persistence has been enabled and configured, create a kernel by opening up a Jupyter Notebook. Save some variable in that notebook and shutdown Enterprise Gateway using `kill -9 PID`, where `PID` is the PID of gateway. Restart Enterprise Gateway and refresh you notebook tab. If all worked correctly, the variable should be loaded without the need to rerun the cell.

If you are using docker, ensure the container isn't tied to the PID of Enterprise Gateway. The container should still run after killing that PID.
9 changes: 7 additions & 2 deletions docs/source/operators/config-cli.md
Original file line number Diff line number Diff line change
Expand Up @@ -106,6 +106,11 @@ EnterpriseGatewayApp(EnterpriseGatewayConfigMixin, JupyterApp) options
will be raised on a failed match. This option requires TLS to be enabled.
It does not support IP addresses. (EG_AUTHORIZED_ORIGIN env var)
Default: ''
--EnterpriseGatewayApp.availability_mode=<CaselessStrEnum>
Specifies the type of availability. Values must be one of "standalone"
or "replication". (EG_AVAILABILITY_MODE env var)
Choices: any of ['standalone', 'replication'] (case-insensitive) or None
Default: None
--EnterpriseGatewayApp.base_url=<Unicode>
The base path for mounting all API resources (EG_BASE_URL env var)
Default: '/'
Expand Down Expand Up @@ -242,7 +247,7 @@ EnterpriseGatewayApp(EnterpriseGatewayConfigMixin, JupyterApp) options
Default: None
--EnterpriseGatewayApp.trust_xheaders=<CBool>
Use x-* header values for overriding the remote-ip, useful when application
is behing a proxy. (EG_TRUST_XHEADERS env var)
is behind a proxy. (EG_TRUST_XHEADERS env var)
Default: False
--EnterpriseGatewayApp.unauthorized_users=<set-item-1>...
Comma-separated list of user names (e.g., ['root','admin']) against which
Expand All @@ -252,7 +257,7 @@ EnterpriseGatewayApp(EnterpriseGatewayConfigMixin, JupyterApp) options
Default: {'root'}
--EnterpriseGatewayApp.ws_ping_interval=<Int>
Specifies the ping interval(in seconds) that should be used by zmq port
associated withspawned kernels.Set this variable to 0 to disable ping mechanism.
associated with spawned kernels.Set this variable to 0 to disable ping mechanism.
(EG_WS_PING_INTERVAL_SECS env var)
Default: 30
--EnterpriseGatewayApp.yarn_endpoint=<Unicode>
Expand Down
39 changes: 0 additions & 39 deletions docs/source/operators/config-kernel-persistence.md

This file was deleted.

2 changes: 1 addition & 1 deletion docs/source/operators/config-security.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Configuring Security
# Configuring security

Jupyter Enterprise Gateway does not currently perform user _authentication_ but, instead, assumes that all users
issuing requests have been previously authenticated. Recommended applications for this are
Expand Down
2 changes: 1 addition & 1 deletion docs/source/operators/deploy-kubernetes.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Deploying Enterprise Gateway on Kubernetes
# Kubernetes deployments

## Overview

Expand Down
2 changes: 1 addition & 1 deletion docs/source/operators/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -65,5 +65,5 @@ Jupyter Enterprise Gateway adheres to
config-kernel-override
config-dynamic
config-culling
config-kernel-persistence
config-availability
config-security
29 changes: 24 additions & 5 deletions enterprise_gateway/enterprisegatewayapp.py
Original file line number Diff line number Diff line change
Expand Up @@ -141,9 +141,28 @@ def init_configurables(self):
config=self.config, # required to get command-line options visible
)

# Attempt to start persisted sessions
# Commented as part of https://github.com/jupyter-server/enterprise_gateway/pull/737#issuecomment-567598751
# self.kernel_session_manager.start_sessions()
# For B/C purposes, check if session persistence is enabled. If so, and availability
# mode is not enabled, go ahead and default availability mode to 'multi-instance'.
if self.kernel_session_manager.enable_persistence:
if self.availability_mode is None:
self.availability_mode = EnterpriseGatewayConfigMixin.AVAILABILITY_REPLICATION
self.log.info(
f"Kernel session persistence is enabled but availability mode is not. "
f"Setting EnterpriseGatewayApp.availability_mode to '{self.availability_mode}'."
)
else:
# Persistence is not enabled, check if availability_mode is configured and, if so,
# auto-enable persistence
if self.availability_mode is not None:
self.kernel_session_manager.enable_persistence = True
self.log.info(
f"Availability mode is set to '{self.availability_mode}' yet kernel session "
"persistence is not enabled. Enabling kernel session persistence."
)

# If we're using single-instance availability, attempt to start persisted sessions
if self.availability_mode == EnterpriseGatewayConfigMixin.AVAILABILITY_STANDALONE:
self.kernel_session_manager.start_sessions()

self.contents_manager = None # Gateways don't use contents manager

Expand Down Expand Up @@ -253,11 +272,11 @@ def _build_ssl_options(self) -> Optional[ssl.SSLContext]:
return ssl_context

def init_http_server(self):
"""Initializes a HTTP server for the Tornado web application on the
"""Initializes an HTTP server for the Tornado web application on the
configured interface and port.

Tries to find an open port if the one configured is not available using
the same logic as the Jupyer Notebook server.
the same logic as the Jupyter Notebook server.
"""
ssl_options = self._build_ssl_options()
self.http_server = httpserver.HTTPServer(
Expand Down
22 changes: 20 additions & 2 deletions enterprise_gateway/mixins.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
from tornado.log import LogFormatter
from traitlets import (
Bool,
CaselessStrEnum,
CBool,
Instance,
Integer,
Expand Down Expand Up @@ -269,7 +270,7 @@ def expose_headers_default(self):
False,
config=True,
help="""Use x-* header values for overriding the remote-ip, useful when
application is behing a proxy. (EG_TRUST_XHEADERS env var)""",
application is behind a proxy. (EG_TRUST_XHEADERS env var)""",
)

@default("trust_xheaders")
Expand Down Expand Up @@ -633,7 +634,7 @@ def max_kernels_per_user_default(self):
ws_ping_interval_default_value,
config=True,
help="""Specifies the ping interval(in seconds) that should be used by zmq port
associated withspawned kernels.Set this variable to 0 to disable ping mechanism.
associated with spawned kernels. Set this variable to 0 to disable ping mechanism.
(EG_WS_PING_INTERVAL_SECS env var)""",
)

Expand Down Expand Up @@ -680,6 +681,23 @@ def dynamic_config_interval_changed(self, event):

dynamic_config_poller = None

# Availability Mode
AVAILABILITY_STANDALONE = "standalone"
AVAILABILITY_REPLICATION = "replication"
availability_mode_env = "EG_AVAILABILITY_MODE"
availability_mode_default_value = None
availability_mode = CaselessStrEnum(
allow_none=True,
values=[AVAILABILITY_REPLICATION, AVAILABILITY_STANDALONE],
config=True,
help="""Specifies the type of availability. Values must be one of "standalone" or "replication".
(EG_AVAILABILITY_MODE env var)""",
)

@default("availability_mode")
def availability_mode_env_default(self):
return os.getenv(self.availability_mode_env, self.availability_mode_default_value)

kernel_spec_manager = Instance("jupyter_client.kernelspec.KernelSpecManager", allow_none=True)

kernel_spec_manager_class = Type(
Expand Down
Loading