Remove the Health Check #14163

spalger · 2017-09-26T00:57:57Z

The platform & operations teams have been bouncing around the idea that Kibana really shouldn't be relying on a polling health check for a long time (years?) and we've had some more serious discussions about it recently that I wanted to catalog.

why did we create a health check?

to ensure that elasticsearch is available and in an expected state
to ensure that the Kibana index exists
to allow us to ignore some of the more difficult error handling and rely on global "status" to indicate that Kibana is unusable (status is "red")

what does the health check do?

the core logic is defined here, but to summarize:

sets the status of the elasticsearch plugin based on a list of checks that verify certain baseline conditions are met.
when these conditions are not met it sets the plugin's status to "red", which blocks rendering of all applications and renders the status page instead
when these conditions are met, sets the plugin's status to "green", which unblocks rendering of applications and is an indicator to other plugins that elasticsearch is usable
a summary of the conditions that must be met for the elasticsearch plugin to be "green":
- all nodes in the elasticsearch cluster must be compatible with Kibana version
- cluster must not be in "tribe" mode
- cluster must not have rest.action.multi.allow_explicit_index=false set
- Kibana index exists (created if missing)
- Kibana index has all of the necessary mappings (patched if some are missing)
- Kibana config document exists (created of upgraded from the most recent non-alpha/beta/snapshot version if it doesn't exist)
- if configured to use tribe node: all nodes in the data cluster must be compatible with Kibana version

what's wrong with the health check?

operates on an interval; just because the health check passed 5 seconds ago does not mean we "know" that elasticsearch is in the state we expect
see Refactor health check #13909, lots of requests made to elasticsearch every 2.5 second by default
users can/do increase the health check interval and increase the amount of time elasticsearch is in an unknown state
lowers our guard
encourages code that does not properly handle/react to error conditions

how might we remove it?

This part is totally up for debate and much of this is pretty freshly discussed. All of these steps are also likely to surface bugs in the way we handle errors across the board. Based on our conversations today I think the plan could go something like this:

[errors/multi.allow_explicit_index] move error handling to browser #14184 handle rest.action.multi.allow_explicit_index=false in the browser
- remove rest.action.multi.allow_explicit_index=false from health check
- when courier receives errors, before it splits the response out to listeners, check if the error is caused by this setting and render a full-page error message that explains the situation to users.
[uiSettings] auto create/upgrade saved config #14164 auto upgrade config
- use update api for uiSettings writes
- create or upgrade the config doc if update returns 404
- remove config from health check
Kibana index template #14112 kibana index template
- when elasticsearch plugin first turns green:
  - check for or add kibana index template
  - only if kibana index already exists, patch the mapping to ensure it has mappings for all saved object types
track es availability (in core?)
- true baseline availability, only checks that we can talk to es
- expose observable es availability status
  - undefined: availability is unknown or in the process of being determined
  - true/false: es seems to be available/unavailable
- expose esAvailability.requestCheck() to signal that availability should be checked
- in elasticsearch plugin:
  - request a check in plugin.init()
  - request a check on an interval (based on elasticsearch.healthCheck.delay)
  - when availability transitions to true:
    - set status to red when versions are incompatible
    - set status to red when admin cluster is tribe
use es availability with clients:
- in a client request interceptor:
  - wait for es availability to be true/false
  - if availability is true process request as normal
  - if availability is false reject with Boom.serverUnavailable() without sending request to elasticsearch
- in a client response interceptor:
  - request an availability check when status code >= 500
  - request an availability check when no status code (ie. timeout, socket hangup, or Boom.serverUnavailable() from request interceptor)
- remove availability check on interval
- maybe? listen to socket events within the clients to know when sockets are closed for reasons that might indicate elasticsearch is gone

The text was updated successfully, but these errors were encountered:

A part of elastic#14163, this removes the portion of the healthCheck that tries to verify that rest.action.multi.allow_explicit_index is not set to false. Instead, a ui module was created that will check errors from elasticsearch for this specific scenario, and exposes a method that will display a nicer "fatal error" screen that informs the user about what they should do, and navigates away from the now broken app.

spalger · 2017-09-26T20:50:23Z

Updated the description to reflect change of plans RE rest.action.multi.allow_explicit_index, see #14184

…14184) * [errors/multi.allow_explicit_index] move error handling to browser A part of #14163, this removes the portion of the healthCheck that tries to verify that rest.action.multi.allow_explicit_index is not set to false. Instead, a ui module was created that will check errors from elasticsearch for this specific scenario, and exposes a method that will display a nicer "fatal error" screen that informs the user about what they should do, and navigates away from the now broken app. * [es/healthCheck] remove old test * fix typo

…14184) * [errors/multi.allow_explicit_index] move error handling to browser A part of #14163, this removes the portion of the healthCheck that tries to verify that rest.action.multi.allow_explicit_index is not set to false. Instead, a ui module was created that will check errors from elasticsearch for this specific scenario, and exposes a method that will display a nicer "fatal error" screen that informs the user about what they should do, and navigates away from the now broken app. * [es/healthCheck] remove old test * fix typo (cherry picked from commit 2b7808b)

…lastic#14184) * [errors/multi.allow_explicit_index] move error handling to browser A part of elastic#14163, this removes the portion of the healthCheck that tries to verify that rest.action.multi.allow_explicit_index is not set to false. Instead, a ui module was created that will check errors from elasticsearch for this specific scenario, and exposes a method that will display a nicer "fatal error" screen that informs the user about what they should do, and navigates away from the now broken app. * [es/healthCheck] remove old test * fix typo

jbudz · 2018-02-20T17:09:15Z

#16733 (comment) has a nice writeup on memory issues this is causing with ES.

spalger added Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc Meta labels Sep 26, 2017

epixa added the high hanging fruit label Sep 26, 2017

spalger mentioned this issue Sep 26, 2017

[errors/multi.allow_explicit_index] move error handling to browser #14184

Merged

jbudz mentioned this issue Feb 20, 2018

Kibana causes off heap memory problems on elasticsearch masters. #16733

Closed

tylersmalley assigned tylersmalley and unassigned tylersmalley Jul 26, 2018

jbudz mentioned this issue Feb 7, 2019

Refactor health check #13909

Closed

rudolf mentioned this issue Aug 28, 2019

[New-Platform][Discuss] Elasticsearch connection availability #43456

Closed

rudolf mentioned this issue Oct 30, 2019

Add Elasticsearch version check to New Platform #49785

Closed

joshdover added the stale Used to mark issues that were closed for being stale label Jan 14, 2021

joshdover closed this as completed Jan 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove the Health Check #14163

Remove the Health Check #14163

spalger commented Sep 26, 2017 •

edited

Loading

spalger commented Sep 26, 2017

jbudz commented Feb 20, 2018

Remove the Health Check #14163

Remove the Health Check #14163

Comments

spalger commented Sep 26, 2017 • edited Loading

why did we create a health check?

what does the health check do?

what's wrong with the health check?

how might we remove it?

spalger commented Sep 26, 2017

jbudz commented Feb 20, 2018

spalger commented Sep 26, 2017 •

edited

Loading