Availability modes #1095

kevin-bates · 2022-05-25T00:14:54Z

#1086 references an issue whereby the loading of persisted kernel sessions at EG's startup was commented out when the changes for #737 were merged. PR #737 essentially enabled the ability to, so to speak, have multiple instances of EG running simultaneously emulating an active-active availability. The previous code, on the other hand, emulated more of an active-passive behavior where only a single EG instance is running but introducing a higher degree of resiliency, as pointed out in #1086. Some users have found that functionality helpful and we should try to accommodate that use case as well.

This pull request introduces a configurable option named availability_mode that can hold one of three values: None (default), active-active, and active-passive. Both non-none values require that kernel session persistence also be enabled. Since 'active-active' was essentially the default behavior (when kernel session persistence was enabled), we will automatically set the availability_mode to active-active whenever kernel session persistence is enabled and availability mode is not - thereby providing a form of backward compatibility.

Users desiring a single-instanced EG that is capable of restarting following an unexpected failure can now use the availability mode of 'active-passive'.

These modes (including kernel session persistence) can be enabled via a configuration file, command line, or environment variables as noted in the documentation or when running jupyter enterprisegateway --help-all.

As noted in the companion documentation, this functionality should be considered experimental!

Resolves: #1086

enterprise_gateway/mixins.py

rahul26goyal · 2022-05-29T14:05:52Z

enterprise_gateway/enterprisegatewayapp.py

+
+            # If we're using active-passive availability, attempt to start persisted sessions
+            if self.availability_mode == "active-passive":
+                self.kernel_session_manager.start_sessions()


@kevin-bates : I went over the previous comments on this decision, but I am still do not understand completely on "why we wouldn't want to load all the sessions from persistence at server start" irrespective of the availability_mode?

Yeah, I've had similar discussions with myself. 😄 (It's good to have someone else to talk with about this!)

I think the primary issue here is affinity and if we load the sessions in active-active then all EG nodes will have a KernelManager thinking they are managing the kernel. In active-active, a "second" node will only manage a "previously managed" kernel when the "previously-managing node" has gone down, so there's still only one node managing the kernel (because we always require "node affinity".

Perhaps the terms active-active and active-passive to describe these modes are not quite correct. As @dnwe pointed out in the issue, they use active-passive as more of a form of resilency than HA (I suppose more for DR). Perhaps we could spin active-active to HA and active-passive to DR?

Thoughts?

my thoughts on the naming the availability modes:
active-passive -> single_instance
"active-active" -> multi_instance

I like these names as they essentially describe the expected configuration of each and don't attempt to overload or conflate the meanings of the classic HA/DR terms.

I would like to continue using hyphens as the separators in the string values. (I view underscores more for variable names and constants.) So let's go with "single-instance" ("active-passive" is used) and "multi-instance" (where "active-active" is used). Does that sound okay?

I've updated the values to use the instance references. Note that I also added code to auto-enable kernel session persistence if not set when availability mode is set. It felt a little overbearing to require the persistence setting when it's required to use "availability". So, rather than throw an exception, we'll log an informational message.

@kevin-bates : I was going over some other service documentation where I found 2 new terms used to describe the similar availability scenarios:

active-passive -> standalone

active-active -> replication

Hmm - I think I like these names over "single-instance" and "multi-instance", especially if there's precedent.

@lresende - you just approved this PR. Are you okay with going with the names "Standalone" and "Replication"?

rahul26goyal · 2022-05-29T14:18:39Z

docs/source/operators/config-availability.md

+Known issues include:
+1. Culling configurations do not account for different nodes and therefore could result in the premature culling of kernels.
+2. Each "node switch" requires a manual reconnect to the kernel.
+


Are the above issues only with "active-active" mode and not with "active-passive" mode of EG?

I think reconnecting is necessary for both forms.

Even with "active-active", because we still expect/advise affinity with the managed kernel, you shouldn't run into an issue where the kernel is culled prematurely because it should always stay on the originating node. Only if the affinity is not configured (or not working) could the kernel be culled prematurely from the previous node.

I'll look into some better wording for this, but we should probably better understand where things are with this before merging. Thanks for this comment.

rahul26goyal · 2022-06-08T04:47:20Z

enterprise_gateway/enterprisegatewayapp.py

+                )
+
+        # If we're using single-instance availability, attempt to start persisted sessions
+        if self.availability_mode == "single-instance":


can we define constants / static variables for these availability_mode so that it can be used across modules / files.

rahul26goyal · 2022-06-08T04:50:58Z

enterprise_gateway/services/kernels/remotemanager.py

@@ -162,7 +162,9 @@ def check_kernel_id(self, kernel_id):
                self.parent.kernel_session_manager.delete_session(kernel_id)
                raise web.HTTPError(404, "Kernel does not exist: %s" % kernel_id)

-    def _refresh_kernel(self, kernel_id):
+    def _refresh_kernel(self, kernel_id) -> bool:
+        if not self.parent.availability_mode or self.parent.availability_mode == "single-instance":


the thought here is, incase of s_i mode, the kernels are already hydrated when the EG server starts..so there is not need to check the persistence for kernel ?

Correct. The multi-kernel manager should be aware of all active kernels in this case.

rahul26goyal

LGTM. Approving it.

kevin-bates · 2022-06-10T13:44:07Z

I still need to apply the final name changes with "Standalone" and "Replication" so let's not merge yet.

kevin-bates · 2022-06-10T19:28:40Z

Need to rework the docs now that #1101 has been merged.

for more information, see https://pre-commit.ci

kevin-bates added enhancement performance & scalability labels May 25, 2022

kevin-bates self-assigned this May 25, 2022

kevin-bates added this to the v3.0 milestone May 25, 2022

kevin-bates mentioned this pull request May 25, 2022

Persisted sessions are not restored at startup, only if requested by kernel ID, leaking pods? #1086

Closed

kevin-bates requested review from lresende and rahul26goyal May 25, 2022 19:00

rahul26goyal reviewed May 29, 2022

View reviewed changes

kevin-bates mentioned this pull request Jun 1, 2022

Added new WebhookKernelSessionManager for Kernel Persistence #1101

Merged

rahul26goyal reviewed Jun 8, 2022

View reviewed changes

lresende approved these changes Jun 8, 2022

View reviewed changes

rahul26goyal approved these changes Jun 9, 2022

View reviewed changes

Zsailer mentioned this pull request Jun 9, 2022

Meeting Notes 2022 jupyter-server/team-compass#15

Closed

kevin-bates force-pushed the availibility-mode branch from 464f59d to 44d8b2c Compare June 13, 2022 21:26

kevin-bates and others added 7 commits June 13, 2022 14:35

Introduce availability modes

522ec9c

[pre-commit.ci] auto fixes from pre-commit.com hooks

9edb339

for more information, see https://pre-commit.ci

Address current review comments

72fb52a

Rename modes to single-instance and multi-instance

04cd7cf

Auto-enable kernel session persistence if availability mode is set

e4f1df9

Incorporate existing kernel persistence docs

835d293

Rename availability modes per review

cad1d85

kevin-bates force-pushed the availibility-mode branch from 44d8b2c to cad1d85 Compare June 13, 2022 21:36

apply renaming to cli options

3112899

kevin-bates merged commit e151870 into jupyter-server:main Jun 27, 2022

kevin-bates deleted the availibility-mode branch June 27, 2022 18:48

kevin-bates mentioned this pull request Nov 21, 2022

New configurable/overridable kernel ZMQ+Websocket connection API jupyter-server/jupyter_server#1047

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Availability modes #1095

Availability modes #1095

kevin-bates commented May 25, 2022

rahul26goyal May 29, 2022

kevin-bates Jun 1, 2022

rahul26goyal Jun 2, 2022

kevin-bates Jun 2, 2022

kevin-bates Jun 2, 2022

rahul26goyal Jun 8, 2022

kevin-bates Jun 8, 2022

rahul26goyal May 29, 2022

kevin-bates Jun 1, 2022 •

edited

Loading

rahul26goyal Jun 8, 2022

rahul26goyal Jun 8, 2022

kevin-bates Jun 8, 2022

rahul26goyal left a comment

kevin-bates commented Jun 10, 2022

kevin-bates commented Jun 10, 2022

Availability modes #1095

Availability modes #1095

Conversation

kevin-bates commented May 25, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kevin-bates Jun 1, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rahul26goyal left a comment

Choose a reason for hiding this comment

kevin-bates commented Jun 10, 2022

kevin-bates commented Jun 10, 2022

kevin-bates Jun 1, 2022 •

edited

Loading