Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove the Health Check #14163

Closed
3 of 5 tasks
spalger opened this issue Sep 26, 2017 · 2 comments
Closed
3 of 5 tasks

Remove the Health Check #14163

spalger opened this issue Sep 26, 2017 · 2 comments
Labels
high hanging fruit Meta stale Used to mark issues that were closed for being stale Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc

Comments

@spalger
Copy link
Contributor

spalger commented Sep 26, 2017

The platform & operations teams have been bouncing around the idea that Kibana really shouldn't be relying on a polling health check for a long time (years?) and we've had some more serious discussions about it recently that I wanted to catalog.

why did we create a health check?

  • to ensure that elasticsearch is available and in an expected state
  • to ensure that the Kibana index exists
  • to allow us to ignore some of the more difficult error handling and rely on global "status" to indicate that Kibana is unusable (status is "red")

what does the health check do?

the core logic is defined here, but to summarize:

  • sets the status of the elasticsearch plugin based on a list of checks that verify certain baseline conditions are met.
  • when these conditions are not met it sets the plugin's status to "red", which blocks rendering of all applications and renders the status page instead
  • when these conditions are met, sets the plugin's status to "green", which unblocks rendering of applications and is an indicator to other plugins that elasticsearch is usable
  • a summary of the conditions that must be met for the elasticsearch plugin to be "green":
    • all nodes in the elasticsearch cluster must be compatible with Kibana version
    • cluster must not be in "tribe" mode
    • cluster must not have rest.action.multi.allow_explicit_index=false set
    • Kibana index exists (created if missing)
    • Kibana index has all of the necessary mappings (patched if some are missing)
    • Kibana config document exists (created of upgraded from the most recent non-alpha/beta/snapshot version if it doesn't exist)
    • if configured to use tribe node: all nodes in the data cluster must be compatible with Kibana version

what's wrong with the health check?

  • operates on an interval; just because the health check passed 5 seconds ago does not mean we "know" that elasticsearch is in the state we expect
  • see Refactor health check #13909, lots of requests made to elasticsearch every 2.5 second by default
  • users can/do increase the health check interval and increase the amount of time elasticsearch is in an unknown state
  • lowers our guard
  • encourages code that does not properly handle/react to error conditions

how might we remove it?

This part is totally up for debate and much of this is pretty freshly discussed. All of these steps are also likely to surface bugs in the way we handle errors across the board. Based on our conversations today I think the plan could go something like this:

  • [errors/multi.allow_explicit_index] move error handling to browser #14184 handle rest.action.multi.allow_explicit_index=false in the browser
    • remove rest.action.multi.allow_explicit_index=false from health check
    • when courier receives errors, before it splits the response out to listeners, check if the error is caused by this setting and render a full-page error message that explains the situation to users.
  • [uiSettings] auto create/upgrade saved config #14164 auto upgrade config
    • use update api for uiSettings writes
    • create or upgrade the config doc if update returns 404
    • remove config from health check
  • Kibana index template #14112 kibana index template
    • when elasticsearch plugin first turns green:
      • check for or add kibana index template
      • only if kibana index already exists, patch the mapping to ensure it has mappings for all saved object types
  • track es availability (in core?)
    • true baseline availability, only checks that we can talk to es
    • expose observable es availability status
      • undefined: availability is unknown or in the process of being determined
      • true/false: es seems to be available/unavailable
    • expose esAvailability.requestCheck() to signal that availability should be checked
    • in elasticsearch plugin:
      • request a check in plugin.init()
      • request a check on an interval (based on elasticsearch.healthCheck.delay)
      • when availability transitions to true:
        • set status to red when versions are incompatible
        • set status to red when admin cluster is tribe
  • use es availability with clients:
    • in a client request interceptor:
      • wait for es availability to be true/false
      • if availability is true process request as normal
      • if availability is false reject with Boom.serverUnavailable() without sending request to elasticsearch
    • in a client response interceptor:
      • request an availability check when status code >= 500
      • request an availability check when no status code (ie. timeout, socket hangup, or Boom.serverUnavailable() from request interceptor)
    • remove availability check on interval
    • maybe? listen to socket events within the clients to know when sockets are closed for reasons that might indicate elasticsearch is gone
@spalger spalger added Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc Meta labels Sep 26, 2017
spalger added a commit to spalger/kibana that referenced this issue Sep 26, 2017
A part of elastic#14163, this removes the portion of the healthCheck that tries to verify that rest.action.multi.allow_explicit_index is not set to false. Instead, a ui module was created that will check errors from elasticsearch for this specific scenario, and exposes a method that will display a nicer "fatal error" screen that informs the user about what they should do, and navigates away from the now broken app.
@spalger
Copy link
Contributor Author

spalger commented Sep 26, 2017

Updated the description to reflect change of plans RE rest.action.multi.allow_explicit_index, see #14184

spalger added a commit that referenced this issue Oct 4, 2017
…14184)

* [errors/multi.allow_explicit_index] move error handling to browser

A part of #14163, this removes the portion of the healthCheck that tries to verify that rest.action.multi.allow_explicit_index is not set to false. Instead, a ui module was created that will check errors from elasticsearch for this specific scenario, and exposes a method that will display a nicer "fatal error" screen that informs the user about what they should do, and navigates away from the now broken app.

* [es/healthCheck] remove old test

* fix typo
spalger added a commit that referenced this issue Oct 4, 2017
…14184)

* [errors/multi.allow_explicit_index] move error handling to browser

A part of #14163, this removes the portion of the healthCheck that tries to verify that rest.action.multi.allow_explicit_index is not set to false. Instead, a ui module was created that will check errors from elasticsearch for this specific scenario, and exposes a method that will display a nicer "fatal error" screen that informs the user about what they should do, and navigates away from the now broken app.

* [es/healthCheck] remove old test

* fix typo

(cherry picked from commit 2b7808b)
chrisronline pushed a commit to chrisronline/kibana that referenced this issue Nov 20, 2017
…lastic#14184)

* [errors/multi.allow_explicit_index] move error handling to browser

A part of elastic#14163, this removes the portion of the healthCheck that tries to verify that rest.action.multi.allow_explicit_index is not set to false. Instead, a ui module was created that will check errors from elasticsearch for this specific scenario, and exposes a method that will display a nicer "fatal error" screen that informs the user about what they should do, and navigates away from the now broken app.

* [es/healthCheck] remove old test

* fix typo
chrisronline pushed a commit to chrisronline/kibana that referenced this issue Dec 1, 2017
…lastic#14184)

* [errors/multi.allow_explicit_index] move error handling to browser

A part of elastic#14163, this removes the portion of the healthCheck that tries to verify that rest.action.multi.allow_explicit_index is not set to false. Instead, a ui module was created that will check errors from elasticsearch for this specific scenario, and exposes a method that will display a nicer "fatal error" screen that informs the user about what they should do, and navigates away from the now broken app.

* [es/healthCheck] remove old test

* fix typo
@jbudz
Copy link
Member

jbudz commented Feb 20, 2018

#16733 (comment) has a nice writeup on memory issues this is causing with ES.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
high hanging fruit Meta stale Used to mark issues that were closed for being stale Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc
Projects
None yet
Development

No branches or pull requests

5 participants