Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[New-Platform][Discuss] Elasticsearch connection availability #43456

Closed
rudolf opened this issue Aug 16, 2019 · 10 comments
Closed

[New-Platform][Discuss] Elasticsearch connection availability #43456

rudolf opened this issue Aug 16, 2019 · 10 comments
Labels
blocker discuss Feature:New Platform Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc

Comments

@rudolf
Copy link
Contributor

rudolf commented Aug 16, 2019

Updated based on discussions: 19 November 2019

The legacy elasticsearch plugin included a health check which periodically checked the following:

If any of these failed the elasticsearch plugin would go into a "red" state signaling to system administrators that Kibana is degraded and that intervention is required.

When there's ES connection or version mismatch problems we can should:

  1. Expose errors to plugins directly (i.e. plugins are expected to gracefully handle ES Connection errors)
  2. Use the status service to signal a failure in core's elasticsearch subservice (blocked by #Migrate status service and status page to New Platform #41983). This will:
    • inform end-users using the Kibana UI that Kibana is in a degraded state and they need to contact their "system administrators".
    • inform system administrators through an error log on the Kibana server.
@rudolf rudolf added Feature:New Platform Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc labels Aug 16, 2019
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-platform

@rudolf
Copy link
Contributor Author

rudolf commented Aug 16, 2019

In the event of (sustained) ES connectivity problems, this is the only course of action that I can think of that either Core or a Plugin author could take:

  1. Inform end-users using the Kibana UI that they need to contact their "system administrators".
  2. Inform system administrators through an error log on the Kibana server.
  3. Don't accept any new requests by blocking the UI and responding with 500 to API requests.
  4. Recover with the least amount of disruption to users once connectivity is established again.

Are there instances where a plugin could reasonably continue working or otherwise gracefully handle such a failure?

Unless we can come up with good examples of where plugins would benefit through having control over how they react to connectivity problems, it feels like we should go with option (2) and treat this as a core concern.

@rudolf
Copy link
Contributor Author

rudolf commented Aug 28, 2019

Related to #14163

@joshdover
Copy link
Contributor

I disagree that we should buffer all ES requests. I think individual plugins will need different behaviors and we shouldn't try to anticipate what should happen.

For example, the Task Manager plugin coordinates jobs across Kibana instances. We could potentially run a job more than once if we started buffering ES requests on a node that got disconnected while another healthy node picks up the job.

I think having a global UI, server error, etc. makes sense, but I also think that any plugin that is doing some background work with ES should be handling the possibility that ES becomes unavailable down themselves.

Maybe with the TaskService (#18854) this behavior would change by stopping background tasks from running. But until plugins don't have direct access to Elasticsearch clients in their lifecycle methods, I don't think we should make assumptions about the requests that plugins are making in the background.

@legrego
Copy link
Member

legrego commented Sep 24, 2019

I'm working on migrating the spaces server code to a NP plugin (#46181), which led me to this topic.

On startup, the Spaces plugin needs to create the Default space if it doesn't already exist. Before attempting this operation, the plugin waits for the following:

  1. xpack_main plugin goes green
  2. License information is available

1 is only met when the legacy ES Plugin goes green (valid connection, ES version etc):

mirrorPluginStatus(server.plugins.elasticsearch, this, 'yellow', 'red');

The Spaces plugin will not "go green" until the Default space has been properly initialized. I'm trying to replicate this behavior in the NP version, but we don't have the notion of ES Availability at this point. I could infer that ES is available if I get a valid license back from the Licensing plugin, but that's not performing a version check as far as I can tell. This is also just a point-in-time connection check too -- if the connection goes down before the default space is created, the Spaces plugin needs to know when to retry its operation before completing its setup (or maybe start) sequence.

@legrego
Copy link
Member

legrego commented Sep 27, 2019

I think having a global UI, server error, etc. makes sense, but I also think that any plugin that is doing some background work with ES should be handling the possibility that ES becomes unavailable down themselves.

++ I agree a global UI makes sense, and plugins certainly need to handle error conditions when trying to communicate with ES. I think it'd be helpful for core to be able to inform plugins whenever ES becomes available. We previously had this via the LP healthcheck, and it looks like a number of plugins are currently relying on this behavior (via await waitUntilReady())

I also think core should prevent plugins from making requests to ES if it knows that Kibana is connected to an unsupported version of ES.

@rudolf
Copy link
Contributor Author

rudolf commented Apr 3, 2020

We've had some discussion around ES node availability, but ES node version mismatches could be more serious since it could lead to data loss.

When Kibana starts up we don't want to start saved object migrations because if an old node writes to the alias after the newer node has started the migration, those writes will go into the old index and might not be read by the new node performing the migration. So it makes sense to block startup when ES nodes mismatch because it could cause data loss.

What's the impact of running against mismatching ES nodes when Kibana is already started up? How should we respond? There's a risk that we're sending requests to an incompatible ES node and the request fails in unpredictable ways. This could be benign like a dashboard no longer loading or more serious like a dashboard showing incorrect results. It could also potentially result in task manager endlessly pulling tasks from its queue and running them, but never being able to write back that the task completed.

There's a couple of potential actions we could take:

  1. Warn users via the status service and status pages that they're running Kibana in an unsupported configuration.
  2. Change Core's status to unavailable / critical which will disable all HTTP routes by returning 503's
  3. Block all ES requests until versions match again, or throw an ESVersionMismatchError for all ES requests.
  4. Throw a fatal error and exit the Kibana process

(3) has the potential for causing data loss or unexpected behaviour because e.g. task manager cannot write that it has completed a task that was already in progress by the time Core realised the ES node versions aren't compatible.

If we only do (1) there's a lot of unknown behaviour that we cannot predict or account for. (2) doesn't prevent all unknown behaviours because there are still background processes which continue running, but it provides a degree of protection and Kibana can quickly become available again once all the ES nodes are compatible again.

(4) is the most aggressive, it prevents any further unknown behaviours from continuing but it also means we're not able to automatically recover once ES nodes are compatible again. Many (most?) Kibana processes running in production would be automatically restarted if they crash in which case we'll bring up the "not ready" server and wait for ES nodes to match before proceeding to load all plugins. So if we assume the Kibana process gets restarted automatically, then Kibana will automatically recover once the ES nodes are compatible again.

I think because we have no way to anticipate or predict the behaviours that would result from this, we should probably assume the worst to protect our users and their data.

@rudolf
Copy link
Contributor Author

rudolf commented Dec 3, 2020

When the ES versions aren't compatible, we want to prevent any new work from being accepted and give plugins a chance to finish any in-progress work. This is very similar to a graceful shutdown so initiating a graceful shutdown might be the best solution #84452

This has the disadvantage that the Kibana process would have to be started up again in order to be available once all ES versions are compatible, but in most environments this would happen automatically.

@legrego
Copy link
Member

legrego commented Dec 3, 2020

I'm positive this would be more work, but instead of a graceful shutdown, could we do a graceful stop, and then have a small piece of core left online to wait for a compatible ES cluster before running through the setup/start phases?

@pgayvallet
Copy link
Contributor

Most of the idea there have been implemented over the years, and the rest is superseded by #170294. Closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocker discuss Feature:New Platform Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc
Projects
Status: Done (7.13)
Development

No branches or pull requests

6 participants