Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC][skip-ci] Prevent plugins from blocking Kibana startup #45796

Merged
merged 17 commits into from
Dec 18, 2019

Conversation

rudolf
Copy link
Contributor

@rudolf rudolf commented Sep 16, 2019

Summary

Prevent plugins from blocking Kibana startup, either through blocking lifecycle methods or blocking context providers from completing.

Edit history:

  • 2019.09.24 Add references to TC39 top-level await
  • 2019.10.02 Incorporate review comments
    • Incorporate @epixa's comments on the drawback of treating the snapshot of state collected during setup as static and valid for the lifetime of the plugin
    • Fleshed out 'Adoption strategy & 'How we teach this' sections
    • Added a list of New Platform plugins or shims which depend on blocking lifecycle functions.
  • 2019.10.29 Address review comments from @joshdover and @azasypkin
    • Fleshed out motivation, this RFC doesn't prevent errors, but isolates the
      impact on the rest of Kibana.
    • Added a footnote explaining that sync lifecycles can still block on sync
      for loops, so it's not a perfect guarantee (from @azasypkin).
    • Updated IContextProvider type signature in (2) to match latest master
    • Dynamically reloading configuration changes should be limited to a
      whitelist, risky changes like the Elasticsearch host should still require a
      complete restart. Added to (3) based on [RFC][skip-ci] Prevent plugins from blocking Kibana startup #45796 (comment)
    • Added Section 5, "Core should expose a status signal for Core services &
      plugins"
    • Added the drawback that incorrect, but valid config would not block Kibana,
      and might only be surfaced when the associated API/UI gets used (from
      @azasypkin)
  • 2019.11.29
    • Expanded example (4.2) to include doing internal async operations which the plugin's API depends on
    • Clarify that context providers won't block kibana, just the handlers that depend on that context
  • 2019.12.12 Updated adoption strategy to a more incremental rollout of introducing timeouts as per alternative (1) while deprecating async lifecycles and only removing support at a later stage.

Rendered RFC

Background

Started as a discussion in #18854

Fixes #45417

@rudolf rudolf changed the title [RFC] Prevent plugins from blocking Kibana startup [RFC][skip-ci] Prevent plugins from blocking Kibana startup Sep 17, 2019
@rudolf rudolf force-pushed the rfc-lifecycle-unblocked branch 3 times, most recently from 56b63f4 to 54018e8 Compare September 17, 2019 09:26
@elastic elastic deleted a comment from elasticmachine Sep 17, 2019
@elastic elastic deleted a comment from elasticmachine Sep 17, 2019
@rudolf rudolf added Feature:New Platform Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc labels Sep 17, 2019
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-platform

@spalger spalger added the test-matrix Use this label to ensure PRs are tested with matrix jobs label Sep 19, 2019
@rudolf rudolf marked this pull request as ready for review September 24, 2019 08:15
// (3.1.1) We can expose a convenient API by doing a lot of work
adminClient: () => {
callAsInternalUser: async (...args) => {
adminClient = await coreSetup.elasticsearch.adminClient$.pipe(take(1)).toPromise();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this proposal, every call of callAsInternalUser can have a different version of adminClient. I suspect it contradicts RFC handler design https://github.com/elastic/kibana/pull/36509/files#diff-8e0211fd03975bea94cb15762f6c7eaaR92

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These examples show what the implementation would look like if we implemented proposal (1) and (2) but not (3). If we do not implement (3), I agree that we would contradict the handler design.

If we implement (3) the context handler will look like example (4.3) which I don't think contradicts the handler design.

I didn't put this in the RFC itself, but I think the handler pattern tried to solve two problems:
A. Improve the ergonomics of consuming API's that are strongly tied to short-lived state like an incoming request. This means not having to call .asScoped(req) inside a request handler the whole time.
B. Improve the ergonomics of consuming Observable-based API's. This means not having to do await someApi$.pipe(first()).toPromise()

I believe proposal (3) solves (B) across all of Core's API's not just API's that are relevant in handlers. I think this is a better alternative to introducing context everywhere even if there isn't some kind of state to "fold" into the context. IMO the contexts should only be used to solve (A).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If context doesn't try to solve (B) because we solved it with (3) then functionality being "fixed" inside a context as described by the Context pattern becomes less important.

// async & await isn't necessary here, but makes example a bit clearer.
// Note that the elasticsearch client no longer exposes an adminClient$
// observable.
const result = await core.elasticsearch.adminClient.callAsInternalUser('ping', ...);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logging service and Elasticsearch services have a different strategy for handling config updates.
Elasticsearch service re-creates cluster client, so dependent plugins know for sure that something has changed and they should use a newer version of the API.
Logging service provides non-reactive API. it handles all config updates internally and dependent plugins know nothing about internal changes.
@azasypkin I believe you mentioned that different design decisions were done deliberately. Does the proposed approach eliminate all the benefits of having explicit elasticsearch client updates?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't think of it, but the logging service is a great example of an API that hides the reactivity 👍

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@azasypkin I believe you mentioned that different design decisions were done deliberately.

Yeah, it was done on purpose, logging is something that is supposed to be used all over the place and adding Rx to the mix could turn into nightmare for the developers. I actually wanted to have the same for ES client, but we didn't get quorum on that in the past.

# Alternatives
## 1. Introduce a lifecycle/context provider timeout
Lifecycle methods and context providers would timeout after X seconds and any
API's they expose would not be available if the timeout had been reached.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have clear boundaries for every stage. We could monitor execution time and log that a plugin initialization lasts too long (kinda similar to the next approach). After that, a user can decide to disable slow plugin or not.

# Drawbacks
Not being able to block on a lifecycle method also means plugins can no longer
be certain that all setup is complete before they reach the start lifecycle.
Plugins will have to manage such state internally. Core will still expose
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sometimes a plugin needs to get async API result. Does it mean that core API has to switch to observables? For example, we have a case for the security plugin
https://github.com/restrry/kibana/blob/48d34abd99ee17fa610e8e5d3ab8dd05e5e4d1d3/src/core/server/http/http_server.ts#L314-L332

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One way to implement this RFC for the createCookieSessionStorageFactory API could look something like this:

Core Setup API becomes synchronous:

export inter HttpServerSetup {
  createCookieSessionStorageFactory: <T>(
    cookieOptions: SessionStorageCookieOptions<T>
  ) => SessionStorageFactory<T>;

Using the SessionStorageFactory becomes async:

export interface SessionStorageFactory<T> {
  asScoped: (request: KibanaRequest) => Promise<SessionStorage<T>>;
}

So the "setup" API is completely synchronous, all registration happens synchronously, but using the API becomes async.

So this line in the security plugin will become

const sessionStorage = await this.options.sessionStorageFactory.asScoped(request);

Drawbacks:
1. Since plugins load serially, even if they don't block startup, all the
delays could add up and potentially make startup very slow.
2. This opens up the potential for a third-party plugin to effectively "break"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if 3rd party plugin stucks on init step, it's hard to diagnose the problem

setup(core, deps) {
  return {
    async getData() {
       return await deps.pluginA.search('data'); // Timeout error because pluginA is not ready or request timeout?
    }
  }
}

@rudolf rudolf requested review from epixa, a team and azasypkin September 24, 2019 20:43
@rudolf rudolf added release_note:skip Skip the PR/issue when compiling release notes v8.0.0 labels Sep 26, 2019
@kobelb
Copy link
Contributor

kobelb commented Sep 30, 2019

Currently, the security plugin effectively blocks startup until it's able to synchronize its privilege definition with Elasticsearch. If I'm following this proposal, this would no longer be possible. Instead, we should block individual operations which are dependent upon this, primarily all calls to Elasticsearch's _has_privileges API. In situations where we're unable to synchronize the privileges into Elasticsearch, we could retry a number of times; however, at some point we'd likely have to give-up this retry and fail the operation. In these situations, how should we surface the error to the user? Do we just log the error? We used to be able to use the status service for this purpose, but as far as I'm aware, this isn't possible in the NP.

A similar situation exists within the Spaces plugin, where we effectively block start-up until we're able to insert the default space. We'd have to adopt a similar approach here of blocking space operations until we're able to insert the default space.

Copy link
Contributor

@epixa epixa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this RFC jives with my thoughts on these lifecycle and contextual functions. Great job breaking this down.

If it isn't obvious from my comments, I think this RFC could benefit from a greater emphasis on two things:

  1. Plugin initialization began synchronous when plugins were introduced in v4.2. It's a relatively new concept that plugins can delay startup at all (they've always been able to prevent startup by throwing an exception). I think this is important because it establishes that we're not proposing that we unwind the fundamentals of how plugins have always worked. Of note: the original plugin system designs we had for the new platform had synchronous lifecycle functions.
  2. Asynchronous lifecycle functions paint a false picture about the nature of integrating with external services. All scenarios that wish to utilize async lifecycle functions ultimately involve an integration with an external service (almost universally Elasticsearch), but any APIs that are backed by such a service must be designed to handle unavailability and state changes in explicit ways anyway. We're not introducing an additional burden with this RFC, we're removing a feature that made it less obvious that this was necessary, which makes it easier to introduce bugs.

rfcs/text/0007_lifecycle_unblocked.md Outdated Show resolved Hide resolved
rfcs/text/0007_lifecycle_unblocked.md Outdated Show resolved Hide resolved
rfcs/text/0007_lifecycle_unblocked.md Outdated Show resolved Hide resolved
rfcs/text/0007_lifecycle_unblocked.md Outdated Show resolved Hide resolved
consumers have to handle the same error condition in two places.

## 2. Treat anything that blocks Kibana from starting up as a bug
Effectively do what we've always been doing. We can improve on the status quo
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, legacy plugins cannot block Kibana from starting. They can change their status to red (the "recommended" approach today), or they can make Kibana exit entirely (which we've treated as a bug), but the ability for a plugin to delay the startup of Kibana until it deems it appropriate to continue is entirely new functionality in the new platform. I'm not sure the best way to express that here, but it seems like an important thing to clarify so people don't think we're talking about undoing an established behavior that Kibana has relied on since the beginning.

rfcs/text/0007_lifecycle_unblocked.md Outdated Show resolved Hide resolved
rfcs/text/0007_lifecycle_unblocked.md Outdated Show resolved Hide resolved
@epixa
Copy link
Contributor

epixa commented Sep 30, 2019

Currently, the security plugin effectively blocks startup until it's able to synchronize its privilege definition with Elasticsearch.

If I understand this correctly (and I hope I am otherwise it seems like we have a bug in the security plugin today), the security plugin is not blocking Kibana startup in order to synchronize its privilege definition. While the security plugin is doing its initialization, it sets it status to red. In the meantime, Kibana starts up successfully, which is to say all other plugins continue their initialization and Kibana binds to a port and starts serving HTTP traffic. Kibana the software starts. Kibana the product is made less available by nature of the status service, which is independent of setup - plugins can go into a red state long after setup completes.

Edit: I think plugins should build their APIs to be resilient (even in the sense of throwing known errors) to their external service state, but if this is too hard of a thing to accomplish alongside new platform migration, we certainly can build a status service into core that behaves similarly to the legacy status behavior. That wouldn't be especially difficult, and it would be no worse than we have today.

@kobelb
Copy link
Contributor

kobelb commented Sep 30, 2019

If I understand this correctly (and I hope I am otherwise it seems like we have a bug in the security plugin today), the security plugin is not blocking Kibana startup in order to synchronize its privilege definition. While the security plugin is doing its initialization, it sets it status to red. In the meantime, Kibana starts up successfully, which is to say all other plugins continue their initialization and Kibana binds to a port and starts serving HTTP traffic. Kibana the software starts. Kibana the product is made less available by nature of the status service, which is independent of setup - plugins can go into a red state long after setup completes.

You're correct. I was conflating startup and the status service being green. I incorrectly assumed that in the NP since we didn't have a status service, we should be doing similar logic within the start lifecycle event itself and I ignored the intricacies of status being able to go from red -> green -> red.

Edit: I think plugins should build their APIs to be resilient (even in the sense of throwing known errors) to their external service state, but if this is too hard of a thing to accomplish alongside new platform migration, we certainly can build a status service into core that behaves similarly to the legacy status behavior. That wouldn't be especially difficult, and it would be no worse than we have today.

For our usages, I don't think it's terrible hard for us to fail when external services aren't in the proper state and throw a descriptive error. My primary concern was around surfacing these errors to the appropriate end-users.

rfcs/text/0007_lifecycle_unblocked.md Outdated Show resolved Hide resolved
rfcs/text/0007_lifecycle_unblocked.md Outdated Show resolved Hide resolved
rfcs/text/0007_lifecycle_unblocked.md Outdated Show resolved Hide resolved
This does not mean we should remove all observables from Core's API's. When an
API consumer is interested in the *state changes itself* it absolutely makes
sense to expose this as an Observable. Good examples of this is exposing
plugin config as this is state that changes over time to which a plugin should
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that necessary? We can still consume async APIs from a synchronous method. We can even expose APIs from a plugin that depend on async APIs from Core. If we adopt this RFC, the only thing we can't do is delay Kibana from starting up due to an asynchronous operation.

Overall, I agree that this distinction is a good idea and that we should stop exposing Observables from Core when they are an implementation detail. However I'm not sure it's required to solve the problem this RFC aims to address. I can see why it would be more ergonomic if we make context providers synchronous, but as I mentioned above, I'm not sure that is necessary (or even desirable?).

Are there examples that demonstrate why removing these Observable APIs is necessary to make lifecycle methods synchronous? If not, I think that change should be a separate proposal.

rfcs/text/0007_lifecycle_unblocked.md Show resolved Hide resolved
rfcs/text/0007_lifecycle_unblocked.md Outdated Show resolved Hide resolved
This does not mean we should remove all observables from Core's API's. When an
API consumer is interested in the *state changes itself* it absolutely makes
sense to expose this as an Observable. Good examples of this is exposing
plugin config as this is state that changes over time to which a plugin should
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

console.log('about to subscribe');
configService.atPath('key').subscribe(value => {
  console.log('got value');
});
console.log('after subscribe');

For me this is an implementation details that should not be relied on. Code should never make postulate that a callback on observable subscription method will execute synchronously IMHO. Definitely needs to provide an sync accessor.

Comment on lines 316 to 318
Once this RFC has been merged we should educate teams on the implications to
allow existing New Platform plugins to remove any async lifecycle
behaviour they're relying on before we change this in core.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Main concern for me is (please correct me if I'm wrong), that to migrate a NP plugin to sync lifecycle methods, it will require all it's dependencies to have migrated first (more specifically, all dependencies that expose async APIs consumed by the plugin), therefor creating a strong migration order ?

If that's the case, with the increasing number of NP plugins, isn't there a risk that the RFC implementation will basically never be able to be integrated in master?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Migrating to sync lifecycle methods only affects API's which required an async operation to resolve before the API was returned from setup(). This means that some API's might change, but if plugin A hasn't been migrated it wouldn't block an upstream plugin from migrating. It will however create more work later on. So even if Plugin B has finished migrating, once Plugin A changes it's API Plugin B will once again have to do some refactoring to adopt the new API.

// Dependency with unmigrated, async lifecycles
class PluginA {
  async setup(core: CoreSetup) {
    return {
      create: async () => {return await core.savedObjects.getInternalRepository.create(...)}
    }
  }
}

// Migrated plugin with sync lifecycles
class PluginB {
  setup(core: CoreSetup, deps) {
    // const wasCreateSuccessful = (await deps.pluginA.create()) != null;
    deps.pluginA.create().then(() => {
      ... // I can consume an async API in setup, and wait till it resolves before
      ... // doing further operations, I just can't block setup.
    });
    return {
      // Plugins can continue to expose API's that make further async calls
      // to dependencies
      proxiedCreate: async () => {return await deps.pluginA.create(...)} 
      /* wasCreateSuccessful it's now impossible to expose this API */ 
    }
  }
}

rfcs/text/0007_lifecycle_unblocked.md Show resolved Hide resolved
@rudolf
Copy link
Contributor Author

rudolf commented Dec 12, 2019

Brandon brought up another area this RFC impacts in #52553. The security plugin registers a RightNavControl user profile button during setup. This user information lives in ES and is therefore async. Rendering this button asynchronously causes an intermediate state where the button is empty and a flicker as the user profile is rendered.

With async lifecycles there would be two ways to prevent this:

  • Asynchronously load the user profile from ES and inject it into the HTML page (we would have to add a new core API for this)
  • Load it during the browser setup() lifecycle blocking all rendering until the user was fetched

Both solutions introduce a way for a single plugin to slow down or prevent the rendering of Kibana.

As @joshdover mentioned, one way to mitigate this is through the use of skeleton screens (visual placeholders)

@rudolf
Copy link
Contributor Author

rudolf commented Dec 12, 2019

Meeting minutes from 12 December 2019
Attendees: @joshdover @pgayvallet @restrry @streamich @lukeelmers

Validated assumptions

The core assumption behind this RFC is that Kibana needs to be resilient against plugin bugs or external conditions that could cause the entire application to be unavailable.

We agreed that as Kibana will be used in increasingly mission critical conditions we cannot afford to introduce the risk that a single plugin takes Kibana down.

Potential solutions

  1. Synchronous lifecycles
    1. CON: Core pushes some error handling responsibility to plugins.
    2. PRO: Degredation happens per API method instead of the whole plugin
    3. PRO: Encourages reactive plugin architecture / API’s
  2. Lifecycle timeout
    1. How do we handle successful setup, but failed start → Plugins need to assume that dependencies could be unavailable (valid for sync lifecycles too).
    2. PRO: More power and simplicity for plugin authors
    3. PRO: Less refactoring for Plugins authors
  3. Sync lifecycles but with deprecated async lifecycles with a timeout as fallback
    1. CON: Inconsistent approaches, not clear when to choose which or what is a valid use case to prefer the fallback -> Deprecation and documentation must clearly communicate that sync lifecycles are the only long term solution that will be supported.
    2. CON: Plugins might never migrate
    3. PRO:
      1. less immediate refactoring
      2. more incremental rollout
      3. gives us flexibility if we discover a use case which needs a change in Core API’s in order to support from a sync lifecycle

For the reasons given if 3(c) we decided to go ahead with option (3) with the eventual goal of dropping async lifecycles completely.

Unknowns

Do plugins ever need to read config values and pass these as parameters to Core API’s? If so we would have to expose synchronous config values to support sync lifecycles.

Adoption strategy

  1. Update RFC with latest adoption strategy, enter final comment period and merge.
  2. Adapt Core API’s to make sync lifecycles easier
  3. Update migration guide and other documentation examples
  4. Deprecate async lifecycles with a warning log
  5. Refactor existing plugin lifecycles which are easily converted to sync
  6. Future: remove async timeout lifecycles

Other: Create Plugin Status RFC

@rudolf
Copy link
Contributor Author

rudolf commented Dec 12, 2019

This RFC has now entered into the final comment period. Unless any fatal flaws are raised, it will be accepted and merged on 17 December.

@rudolf rudolf added the RFC/final-comment-period If no concerns are raised in 3 business days this RFC will be accepted label Dec 12, 2019
Copy link
Contributor

@joshdover joshdover left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rubber stamp approval based on our meeting today

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:New Platform release_note:skip Skip the PR/issue when compiling release notes RFC/final-comment-period If no concerns are raised in 3 business days this RFC will be accepted Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc test-matrix Use this label to ensure PRs are tested with matrix jobs v8.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[new-platform] Prevent plugins lifecycle methods from blocking kibana startup
10 participants