[RFC][skip-ci] Prevent plugins from blocking Kibana startup #45796

rudolf · 2019-09-16T16:18:32Z

Summary

Prevent plugins from blocking Kibana startup, either through blocking lifecycle methods or blocking context providers from completing.

Edit history:

2019.09.24 Add references to TC39 top-level await
2019.10.02 Incorporate review comments
- Incorporate @epixa's comments on the drawback of treating the snapshot of state collected during setup as static and valid for the lifetime of the plugin
- Fleshed out 'Adoption strategy & 'How we teach this' sections
- Added a list of New Platform plugins or shims which depend on blocking lifecycle functions.
2019.10.29 Address review comments from @joshdover and @azasypkin
- Fleshed out motivation, this RFC doesn't prevent errors, but isolates the
  impact on the rest of Kibana.
- Added a footnote explaining that sync lifecycles can still block on sync
  for loops, so it's not a perfect guarantee (from @azasypkin).
- Updated IContextProvider type signature in (2) to match latest master
- Dynamically reloading configuration changes should be limited to a
  whitelist, risky changes like the Elasticsearch host should still require a
  complete restart. Added to (3) based on [RFC][skip-ci] Prevent plugins from blocking Kibana startup #45796 (comment)
- Added Section 5, "Core should expose a status signal for Core services &
  plugins"
- Added the drawback that incorrect, but valid config would not block Kibana,
  and might only be surfaced when the associated API/UI gets used (from
  @azasypkin)
2019.11.29
- Expanded example (4.2) to include doing internal async operations which the plugin's API depends on
- Clarify that context providers won't block kibana, just the handlers that depend on that context
2019.12.12 Updated adoption strategy to a more incremental rollout of introducing timeouts as per alternative (1) while deprecating async lifecycles and only removing support at a later stage.

Rendered RFC

Background

Started as a discussion in #18854

Fixes #45417

elasticmachine · 2019-09-17T09:36:20Z

Pinging @elastic/kibana-platform

mshustov · 2019-09-24T13:24:05Z

rfcs/text/0006_lifecycle_unblocked.md

+      // (3.1.1) We can expose a convenient API by doing a lot of work
+      adminClient: () => {
+        callAsInternalUser: async (...args) => {
+          adminClient = await coreSetup.elasticsearch.adminClient$.pipe(take(1)).toPromise();


With this proposal, every call of callAsInternalUser can have a different version of adminClient. I suspect it contradicts RFC handler design https://github.com/elastic/kibana/pull/36509/files#diff-8e0211fd03975bea94cb15762f6c7eaaR92

These examples show what the implementation would look like if we implemented proposal (1) and (2) but not (3). If we do not implement (3), I agree that we would contradict the handler design.

If we implement (3) the context handler will look like example (4.3) which I don't think contradicts the handler design.

I didn't put this in the RFC itself, but I think the handler pattern tried to solve two problems:
A. Improve the ergonomics of consuming API's that are strongly tied to short-lived state like an incoming request. This means not having to call .asScoped(req) inside a request handler the whole time.
B. Improve the ergonomics of consuming Observable-based API's. This means not having to do await someApi$.pipe(first()).toPromise()

I believe proposal (3) solves (B) across all of Core's API's not just API's that are relevant in handlers. I think this is a better alternative to introducing context everywhere even if there isn't some kind of state to "fold" into the context. IMO the contexts should only be used to solve (A).

If context doesn't try to solve (B) because we solved it with (3) then functionality being "fixed" inside a context as described by the Context pattern becomes less important.

mshustov · 2019-09-24T13:39:58Z

rfcs/text/0006_lifecycle_unblocked.md

+        // async & await isn't necessary here, but makes example a bit clearer.
+        // Note that the elasticsearch client no longer exposes an adminClient$
+        // observable.
+        const result = await core.elasticsearch.adminClient.callAsInternalUser('ping', ...);


Logging service and Elasticsearch services have a different strategy for handling config updates.
Elasticsearch service re-creates cluster client, so dependent plugins know for sure that something has changed and they should use a newer version of the API.
Logging service provides non-reactive API. it handles all config updates internally and dependent plugins know nothing about internal changes.
@azasypkin I believe you mentioned that different design decisions were done deliberately. Does the proposed approach eliminate all the benefits of having explicit elasticsearch client updates?

I didn't think of it, but the logging service is a great example of an API that hides the reactivity 👍

@azasypkin I believe you mentioned that different design decisions were done deliberately.

Yeah, it was done on purpose, logging is something that is supposed to be used all over the place and adding Rx to the mix could turn into nightmare for the developers. I actually wanted to have the same for ES client, but we didn't get quorum on that in the past.

mshustov · 2019-09-24T14:05:09Z

rfcs/text/0006_lifecycle_unblocked.md

+# Alternatives
+## 1. Introduce a lifecycle/context provider timeout
+Lifecycle methods and context providers would timeout after X seconds and any
+API's they expose would not be available if the timeout had been reached.


We have clear boundaries for every stage. We could monitor execution time and log that a plugin initialization lasts too long (kinda similar to the next approach). After that, a user can decide to disable slow plugin or not.

mshustov · 2019-09-24T14:20:09Z

rfcs/text/0006_lifecycle_unblocked.md

+# Drawbacks
+Not being able to block on a lifecycle method also means plugins can no longer
+be certain that all setup is complete before they reach the start lifecycle.
+Plugins will have to manage such state internally. Core will still expose


Sometimes a plugin needs to get async API result. Does it mean that core API has to switch to observables? For example, we have a case for the security plugin
https://github.com/restrry/kibana/blob/48d34abd99ee17fa610e8e5d3ab8dd05e5e4d1d3/src/core/server/http/http_server.ts#L314-L332

One way to implement this RFC for the createCookieSessionStorageFactory API could look something like this:

Core Setup API becomes synchronous:

export inter HttpServerSetup { createCookieSessionStorageFactory: <T>( cookieOptions: SessionStorageCookieOptions<T> ) => SessionStorageFactory<T>;

Using the SessionStorageFactory becomes async:

export interface SessionStorageFactory<T> { asScoped: (request: KibanaRequest) => Promise<SessionStorage<T>>; }

So the "setup" API is completely synchronous, all registration happens synchronously, but using the API becomes async.

So this line in the security plugin will become

const sessionStorage = await this.options.sessionStorageFactory.asScoped(request);

mshustov · 2019-09-24T14:36:06Z

rfcs/text/0006_lifecycle_unblocked.md

+Drawbacks:
+1. Since plugins load serially, even if they don't block startup, all the
+   delays could add up and potentially make startup very slow.
+2. This opens up the potential for a third-party plugin to effectively "break"


if 3rd party plugin stucks on init step, it's hard to diagnose the problem

setup(core, deps) { return { async getData() { return await deps.pluginA.search('data'); // Timeout error because pluginA is not ready or request timeout? } } }

kobelb · 2019-09-30T17:33:25Z

Currently, the security plugin effectively blocks startup until it's able to synchronize its privilege definition with Elasticsearch. If I'm following this proposal, this would no longer be possible. Instead, we should block individual operations which are dependent upon this, primarily all calls to Elasticsearch's _has_privileges API. In situations where we're unable to synchronize the privileges into Elasticsearch, we could retry a number of times; however, at some point we'd likely have to give-up this retry and fail the operation. In these situations, how should we surface the error to the user? Do we just log the error? We used to be able to use the status service for this purpose, but as far as I'm aware, this isn't possible in the NP.

A similar situation exists within the Spaces plugin, where we effectively block start-up until we're able to insert the default space. We'd have to adopt a similar approach here of blocking space operations until we're able to insert the default space.

epixa

Overall, this RFC jives with my thoughts on these lifecycle and contextual functions. Great job breaking this down.

If it isn't obvious from my comments, I think this RFC could benefit from a greater emphasis on two things:

Plugin initialization began synchronous when plugins were introduced in v4.2. It's a relatively new concept that plugins can delay startup at all (they've always been able to prevent startup by throwing an exception). I think this is important because it establishes that we're not proposing that we unwind the fundamentals of how plugins have always worked. Of note: the original plugin system designs we had for the new platform had synchronous lifecycle functions.
Asynchronous lifecycle functions paint a false picture about the nature of integrating with external services. All scenarios that wish to utilize async lifecycle functions ultimately involve an integration with an external service (almost universally Elasticsearch), but any APIs that are backed by such a service must be designed to handle unavailability and state changes in explicit ways anyway. We're not introducing an additional burden with this RFC, we're removing a feature that made it less obvious that this was necessary, which makes it easier to introduce bugs.

rfcs/text/0007_lifecycle_unblocked.md

epixa · 2019-09-30T20:03:44Z

rfcs/text/0007_lifecycle_unblocked.md

+   consumers have to handle the same error condition in two places.
+
+## 2. Treat anything that blocks Kibana from starting up as a bug
+Effectively do what we've always been doing. We can improve on the status quo


To be clear, legacy plugins cannot block Kibana from starting. They can change their status to red (the "recommended" approach today), or they can make Kibana exit entirely (which we've treated as a bug), but the ability for a plugin to delay the startup of Kibana until it deems it appropriate to continue is entirely new functionality in the new platform. I'm not sure the best way to express that here, but it seems like an important thing to clarify so people don't think we're talking about undoing an established behavior that Kibana has relied on since the beginning.

rfcs/text/0007_lifecycle_unblocked.md

epixa · 2019-09-30T20:21:31Z

Currently, the security plugin effectively blocks startup until it's able to synchronize its privilege definition with Elasticsearch.

If I understand this correctly (and I hope I am otherwise it seems like we have a bug in the security plugin today), the security plugin is not blocking Kibana startup in order to synchronize its privilege definition. While the security plugin is doing its initialization, it sets it status to red. In the meantime, Kibana starts up successfully, which is to say all other plugins continue their initialization and Kibana binds to a port and starts serving HTTP traffic. Kibana the software starts. Kibana the product is made less available by nature of the status service, which is independent of setup - plugins can go into a red state long after setup completes.

Edit: I think plugins should build their APIs to be resilient (even in the sense of throwing known errors) to their external service state, but if this is too hard of a thing to accomplish alongside new platform migration, we certainly can build a status service into core that behaves similarly to the legacy status behavior. That wouldn't be especially difficult, and it would be no worse than we have today.

kobelb · 2019-09-30T22:15:55Z

If I understand this correctly (and I hope I am otherwise it seems like we have a bug in the security plugin today), the security plugin is not blocking Kibana startup in order to synchronize its privilege definition. While the security plugin is doing its initialization, it sets it status to red. In the meantime, Kibana starts up successfully, which is to say all other plugins continue their initialization and Kibana binds to a port and starts serving HTTP traffic. Kibana the software starts. Kibana the product is made less available by nature of the status service, which is independent of setup - plugins can go into a red state long after setup completes.

You're correct. I was conflating startup and the status service being green. I incorrectly assumed that in the NP since we didn't have a status service, we should be doing similar logic within the start lifecycle event itself and I ignored the intricacies of status being able to go from red -> green -> red.

Edit: I think plugins should build their APIs to be resilient (even in the sense of throwing known errors) to their external service state, but if this is too hard of a thing to accomplish alongside new platform migration, we certainly can build a status service into core that behaves similarly to the legacy status behavior. That wouldn't be especially difficult, and it would be no worse than we have today.

For our usages, I don't think it's terrible hard for us to fail when external services aren't in the proper state and throw a descriptive error. My primary concern was around surfacing these errors to the appropriate end-users.

Co-Authored-By: Court Ewing <[email protected]>

rfcs/text/0007_lifecycle_unblocked.md

joshdover · 2019-10-30T20:59:25Z

rfcs/text/0007_lifecycle_unblocked.md

+This does not mean we should remove all observables from Core's API's. When an
+API consumer is interested in the *state changes itself* it absolutely makes
+sense to expose this as an Observable. Good examples of this is exposing
+plugin config as this is state that changes over time to which a plugin should


Is that necessary? We can still consume async APIs from a synchronous method. We can even expose APIs from a plugin that depend on async APIs from Core. If we adopt this RFC, the only thing we can't do is delay Kibana from starting up due to an asynchronous operation.

Overall, I agree that this distinction is a good idea and that we should stop exposing Observables from Core when they are an implementation detail. However I'm not sure it's required to solve the problem this RFC aims to address. I can see why it would be more ergonomic if we make context providers synchronous, but as I mentioned above, I'm not sure that is necessary (or even desirable?).

Are there examples that demonstrate why removing these Observable APIs is necessary to make lifecycle methods synchronous? If not, I think that change should be a separate proposal.

rfcs/text/0007_lifecycle_unblocked.md

Co-Authored-By: Josh Dover <[email protected]>

pgayvallet · 2019-11-01T18:52:34Z

rfcs/text/0007_lifecycle_unblocked.md

+This does not mean we should remove all observables from Core's API's. When an
+API consumer is interested in the *state changes itself* it absolutely makes
+sense to expose this as an Observable. Good examples of this is exposing
+plugin config as this is state that changes over time to which a plugin should


console.log('about to subscribe'); configService.atPath('key').subscribe(value => { console.log('got value'); }); console.log('after subscribe');

For me this is an implementation details that should not be relied on. Code should never make postulate that a callback on observable subscription method will execute synchronously IMHO. Definitely needs to provide an sync accessor.

pgayvallet · 2019-11-01T19:32:55Z

rfcs/text/0007_lifecycle_unblocked.md

+Once this RFC has been merged we should educate teams on the implications to
+allow existing New Platform plugins to remove any async lifecycle
+behaviour they're relying on before we change this in core.


Main concern for me is (please correct me if I'm wrong), that to migrate a NP plugin to sync lifecycle methods, it will require all it's dependencies to have migrated first (more specifically, all dependencies that expose async APIs consumed by the plugin), therefor creating a strong migration order ?

If that's the case, with the increasing number of NP plugins, isn't there a risk that the RFC implementation will basically never be able to be integrated in master?

Migrating to sync lifecycle methods only affects API's which required an async operation to resolve before the API was returned from setup(). This means that some API's might change, but if plugin A hasn't been migrated it wouldn't block an upstream plugin from migrating. It will however create more work later on. So even if Plugin B has finished migrating, once Plugin A changes it's API Plugin B will once again have to do some refactoring to adopt the new API.

// Dependency with unmigrated, async lifecycles class PluginA { async setup(core: CoreSetup) { return { create: async () => {return await core.savedObjects.getInternalRepository.create(...)} } } } // Migrated plugin with sync lifecycles class PluginB { setup(core: CoreSetup, deps) { // const wasCreateSuccessful = (await deps.pluginA.create()) != null; deps.pluginA.create().then(() => { ... // I can consume an async API in setup, and wait till it resolves before ... // doing further operations, I just can't block setup. }); return { // Plugins can continue to expose API's that make further async calls // to dependencies proxiedCreate: async () => {return await deps.pluginA.create(...)} /* wasCreateSuccessful it's now impossible to expose this API */ } } }

rfcs/text/0007_lifecycle_unblocked.md

Co-Authored-By: Josh Dover <[email protected]>

rudolf · 2019-12-12T12:46:10Z

Brandon brought up another area this RFC impacts in #52553. The security plugin registers a RightNavControl user profile button during setup. This user information lives in ES and is therefore async. Rendering this button asynchronously causes an intermediate state where the button is empty and a flicker as the user profile is rendered.

With async lifecycles there would be two ways to prevent this:

Asynchronously load the user profile from ES and inject it into the HTML page (we would have to add a new core API for this)
Load it during the browser setup() lifecycle blocking all rendering until the user was fetched

Both solutions introduce a way for a single plugin to slow down or prevent the rendering of Kibana.

As @joshdover mentioned, one way to mitigate this is through the use of skeleton screens (visual placeholders)

rudolf · 2019-12-12T19:16:23Z

Meeting minutes from 12 December 2019
Attendees: @joshdover @pgayvallet @restrry @streamich @lukeelmers

Validated assumptions

The core assumption behind this RFC is that Kibana needs to be resilient against plugin bugs or external conditions that could cause the entire application to be unavailable.

We agreed that as Kibana will be used in increasingly mission critical conditions we cannot afford to introduce the risk that a single plugin takes Kibana down.

Potential solutions

Synchronous lifecycles
1. CON: Core pushes some error handling responsibility to plugins.
2. PRO: Degredation happens per API method instead of the whole plugin
3. PRO: Encourages reactive plugin architecture / API’s
Lifecycle timeout
1. How do we handle successful setup, but failed start → Plugins need to assume that dependencies could be unavailable (valid for sync lifecycles too).
2. PRO: More power and simplicity for plugin authors
3. PRO: Less refactoring for Plugins authors
Sync lifecycles but with deprecated async lifecycles with a timeout as fallback
1. CON: Inconsistent approaches, not clear when to choose which or what is a valid use case to prefer the fallback -> Deprecation and documentation must clearly communicate that sync lifecycles are the only long term solution that will be supported.
2. CON: Plugins might never migrate
3. PRO:
  1. less immediate refactoring
  2. more incremental rollout
  3. gives us flexibility if we discover a use case which needs a change in Core API’s in order to support from a sync lifecycle

For the reasons given if 3(c) we decided to go ahead with option (3) with the eventual goal of dropping async lifecycles completely.

Unknowns

Do plugins ever need to read config values and pass these as parameters to Core API’s? If so we would have to expose synchronous config values to support sync lifecycles.

Adoption strategy

Update RFC with latest adoption strategy, enter final comment period and merge.
Adapt Core API’s to make sync lifecycles easier
Update migration guide and other documentation examples
Deprecate async lifecycles with a warning log
Refactor existing plugin lifecycles which are easily converted to sync
Future: remove async timeout lifecycles

Other: Create Plugin Status RFC

rudolf · 2019-12-12T20:06:00Z

This RFC has now entered into the final comment period. Unless any fatal flaws are raised, it will be accepted and merged on 17 December.

joshdover

Rubber stamp approval based on our meeting today

Draft RFC for unblocking plugin lifecycle methods

5d7d0c9

rudolf force-pushed the rfc-lifecycle-unblocked branch from 07ab8d4 to ec5de3a Compare September 17, 2019 08:48

rudolf changed the title ~~[RFC] Prevent plugins from blocking Kibana startup~~ [RFC][skip-ci] Prevent plugins from blocking Kibana startup Sep 17, 2019

rudolf force-pushed the rfc-lifecycle-unblocked branch 3 times, most recently from 56b63f4 to 54018e8 Compare September 17, 2019 09:26

Draft RFC for unblocking kibana startup

ee1ec45

rudolf force-pushed the rfc-lifecycle-unblocked branch from 54018e8 to ee1ec45 Compare September 17, 2019 09:35

elastic deleted a comment from elasticmachine Sep 17, 2019

rudolf added Feature:New Platform Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc labels Sep 17, 2019

spalger added the test-matrix Use this label to ensure PRs are tested with matrix jobs label Sep 19, 2019

rudolf marked this pull request as ready for review September 24, 2019 08:15

mshustov reviewed Sep 24, 2019

View reviewed changes

rudolf added 2 commits September 24, 2019 22:08

Rename rfc from 0006 to 0007

a125d8d

Add references to TC39 top-level await

e91b83c

rudolf requested review from epixa, a team and azasypkin September 24, 2019 20:43

rudolf added release_note:skip Skip the PR/issue when compiling release notes v8.0.0 labels Sep 26, 2019

epixa reviewed Sep 30, 2019

View reviewed changes

rudolf and others added 2 commits October 1, 2019 15:30

Update with review suggestion

aa1a788

Co-Authored-By: Court Ewing <[email protected]>

Update RFC#0007

977a43b

rudolf requested review from epixa, azasypkin, mshustov and joshdover October 30, 2019 09:39

joshdover reviewed Oct 30, 2019

View reviewed changes

Apply suggestions from code review

e606324

Co-Authored-By: Josh Dover <[email protected]>

kobelb mentioned this pull request Oct 31, 2019

Migrate authorization subsystem to the new platform. #46145

Merged

pgayvallet reviewed Nov 1, 2019

View reviewed changes

joshdover mentioned this pull request Nov 11, 2019

[APM] New platform migration: replace injected vars #44232

Closed

epixa removed their request for review November 19, 2019 15:32

dgieselaar mentioned this pull request Nov 28, 2019

[APM] Address fallout from NP server migration #51725

Merged

rudolf and others added 4 commits November 29, 2019 13:05

Update rfcs/text/0007_lifecycle_unblocked.md

332182a

Co-Authored-By: Josh Dover <[email protected]>

Merge branch 'master' into rfc-lifecycle-unblocked

935ae76

Example of plugin exposing API dependent on internal async operation

948f705

Clarify that context providers won't block kibana, just their handlers

e934829

kobelb mentioned this pull request Dec 9, 2019

Non-global injected initial data #52553

Open

rudolf added 2 commits December 12, 2019 20:51

Update adoption strategy as per latest discussion

6d3ddd4

Fix formatting

6042a01

rudolf added the RFC/final-comment-period If no concerns are raised in 3 business days this RFC will be accepted label Dec 12, 2019

joshdover approved these changes Dec 12, 2019

View reviewed changes

streamich approved these changes Dec 12, 2019

View reviewed changes

pgayvallet approved these changes Dec 12, 2019

View reviewed changes

rudolf merged commit 42d868d into elastic:master Dec 18, 2019

rudolf deleted the rfc-lifecycle-unblocked branch December 18, 2019 13:52

mshustov mentioned this pull request Dec 27, 2019

[Core] Don't expose Elasticsearch client as Observable #53267

Closed

This was referenced Jan 27, 2020

[Licensing] Enabling and disabling functionality based on current license in public #55985

Closed

[Watcher] More robust handling of license at setup #55831

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC][skip-ci] Prevent plugins from blocking Kibana startup #45796

[RFC][skip-ci] Prevent plugins from blocking Kibana startup #45796

rudolf commented Sep 16, 2019 •

edited

Loading

elasticmachine commented Sep 17, 2019

mshustov Sep 24, 2019

rudolf Sep 26, 2019

rudolf Sep 26, 2019

mshustov Sep 24, 2019

rudolf Sep 25, 2019

azasypkin Oct 8, 2019

mshustov Sep 24, 2019

mshustov Sep 24, 2019

rudolf Sep 24, 2019

mshustov Sep 24, 2019

kobelb commented Sep 30, 2019 •

edited

Loading

epixa left a comment

epixa Sep 30, 2019

epixa commented Sep 30, 2019 •

edited

Loading

kobelb commented Sep 30, 2019

joshdover Oct 30, 2019

pgayvallet Nov 1, 2019

pgayvallet Nov 1, 2019

rudolf Nov 29, 2019

rudolf commented Dec 12, 2019

rudolf commented Dec 12, 2019

rudolf commented Dec 12, 2019

joshdover left a comment

[RFC][skip-ci] Prevent plugins from blocking Kibana startup #45796

[RFC][skip-ci] Prevent plugins from blocking Kibana startup #45796

Conversation

rudolf commented Sep 16, 2019 • edited Loading

Summary

Edit history:

Background

elasticmachine commented Sep 17, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kobelb commented Sep 30, 2019 • edited Loading

epixa left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

epixa commented Sep 30, 2019 • edited Loading

kobelb commented Sep 30, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rudolf commented Dec 12, 2019

rudolf commented Dec 12, 2019

Validated assumptions

Potential solutions

Unknowns

Adoption strategy

rudolf commented Dec 12, 2019

joshdover left a comment

Choose a reason for hiding this comment

rudolf commented Sep 16, 2019 •

edited

Loading

kobelb commented Sep 30, 2019 •

edited

Loading

epixa commented Sep 30, 2019 •

edited

Loading