Kibana causes off heap memory problems on elasticsearch masters. #16733

kszarlej · 2018-02-14T13:58:16Z

Kibana version:
5.6.3 (ocker pull docker.elastic.co/kibana/kibana:5.6.3)

Elasticsearch version:
5.6.3 (docker.elastic.co/elasticsearch/elasticsearch:5.6.3)

Server OS version:
Ubuntu 14.04 LTS

Browser version:
Not relevant to issue

Browser OS version:
Not relevant to issue

Original install method (e.g. download page, yum, from source, etc.):
Docker images from docker.elastic.co registry.

Description of the problem including expected versus actual behavior:
The issue was also carefully described in this topic: https://discuss.elastic.co/t/elasticsearch-5-6-very-quickly-increasing-direct-buffer-pools/119561

I noticed on my master nodes that their Direct Memory Buffers Size are raising very quickly. Before I limited MaxDirectMemorySize it was causing memory exhaustion on master nodes (even when heap was only 30% full) . Off heap memory (direct memory) was driving the RSS size of JAVA process to exceed 99% of available RAM and master nodes were dying either by swapping or after disabling swap because of OOMs.

Currently I fixed this issues with memory exhaustion by setting -XX:MaxDirectMemorySize option to 20% of RAM.

However now because of the very fast growth of direct memory JAVA is saving itself by very often requesting ExplicitGCs. Currently I have 700mb limit for -XX:MaxDirectMemorySize, this memory fills up in about 3-4 minutes. When limit is hit the ExplicitGC is run to clean these buffers.

On a chart this looks as follows:

At the same time DirectMemoryBuffers chart

Each red annotation line is ExplicitGC caused by the -XX:MaxDirectMemorySize limit.

And now the kibana part :) Whenever I turn off kibana the direct memory growth on master node freezes. I excluded possibility of some strange query because:

Most of the user queries are during the day, excessive memory buffers growth happens all the time,
I enabled logging of the query and fetch operations - for period of 20 minutes there were only 300 queries, the longest one took 279 miliseconds. During that time buffers were of course growing like hell 250/300mb per second,
When I DROPed all incoming traffic to kibana (users and TVs couldn't connect) the direct memory buffers were still growing.

Looking at the traffic between nodes in cluster I made a following break down:

Network traffic on on essearch1 and master node are almost identical in numbers which means that essearch1 causes the traffic on master node,
Essearch1 is used by kibana for elasticsearch_url so the traffic from essearch1 to master node is most likely caused by kibana.

Then I started dumping traffic. essearch1 is sending to master thousands of "empty" packets. They look like just a TCP header with empty payload. There are thousands of them and they are showing in dump according to some interval, all at once. Seems like an job with defined interval like health check.

10:35:11.482908 IP essearch1.example.42055 > ip-172-20-33-224.eu-west-1.compute.internal.9300: Flags [.], ack 4273644020, win 12305, options [nop,nop,TS val 645666083 ecr 12010824], length 0
	0x0000:  0004 0001 0006 0aac 346f 5278 0000 0800    ........4oRx....
	0x0010:  4500 0034 e72f 4000 4006 b7d1 ac11 21bd       E..4./@.@.....!.
	0x0020:  ac14 21e0 a447 2454 cb0f 54e6 feba a1f4         ..!..G$T..T.....
	0x0030:  8010 3011 9be9 0000 0101 080a 267c 1523      ..0.........&|.#
	0x0040:  00b7 4548                                                             ..EH

I wanted to know the source of these so I sniffed to inbound traffic from kibana to essearch1 box and noticed lots of packets from kibana which seem to be appearing on same interval as packets sent from essearch1 to master1. We are using HTTP LB on the front of essearches so I checked access logs. Here are all extracted queries that kibana makes regularly to essearches. They seem like health checks + some unknown to me actions.

essearch.example 172.17.32.9 - - [14/Feb/2018:13:00:38 +0000]  "HEAD / HTTP/1.1" 200 0 165 "-" "-" "253619" "0.002" "0.002" "172.17.33.189:9200" "200" "-" "-"
essearch.example 172.17.32.9 - - [14/Feb/2018:13:00:38 +0000]  "GET /_nodes?filter_path=nodes.*.version%2Cnodes.*.http.publish_address%2Cnodes.*.ip HTTP/1.1" 200 921 242 "-" "-" "253619" "0.019" "0.019" "172.17.33.189:9200" "200" "-" "-"
essearch.example 172.17.32.9 - - [14/Feb/2018:13:00:38 +0000]  "GET /_nodes/_local?filter_path=nodes.*.settings.tribe HTTP/1.1" 200 2 212 "-" "-" "253619" "0.001" "0.001" "172.17.33.189:9200" "200" "-" "-"
essearch.example 172.17.32.9 - - [14/Feb/2018:13:00:38 +0000]  "POST /_mget HTTP/1.1" 200 145 284 "-" "-" "253619" "0.015" "0.015" "172.17.33.189:9200" "200" "-" "-"
essearch.example 172.17.32.9 - - [14/Feb/2018:13:00:38 +0000]  "GET /_cluster/health/.kibana?timeout=5s HTTP/1.1" 200 391 198 "-" "-" "253619" "0.017" "0.017" "172.17.33.189:9200" "200" "-" "-"
essearch.example 172.17.32.9 - - [14/Feb/2018:13:00:38 +0000]  "GET /.kibana/_mappings HTTP/1.1" 200 2394 181 "-" "-" "253619" "0.016" "0.016" "172.17.33.189:9200" "200" "-" "-"
essearch.example 172.17.32.9 - - [14/Feb/2018:13:00:38 +0000]  "POST /.kibana/_search?size=1000&from=0 HTTP/1.1" 200 738 527 "-" "-" "253619" "0.016" "0.016" "172.17.33.189:9200" "200" "-" "-"

Now on LB I started dropping these queries while observing the direct memory buffers to see what happens in order to find which query causes problems. I started with /_nodes prefix cause these queries are having some wildcards and are looking suspicious.

location /_nodes {
     return 403;
}

And at this moment buffers size freezed just like on below chart:

So I narrowed down problematic queries to be:

GET /_nodes?filter_path=nodes.*.version%2Cnodes.*.http.publish_address%2Cnodes.*.ip
and
GET /_nodes/_local?filter_path=nodes.*.settings.tribe

What are they? How can we disable it?

I don't think that my cluster is misconfigured (of course that always might be the case :P), we are using default configuration that comes with the docker images (we use official docker images) and we have only modified Xms and Xmx to fit to size of our boxes and MaxDirectMemorySize mentioned on the beggining of issue description.

I tested it also on kibana 5.6.7 - the same problem. On 5.3.0 this problem seems to not exist. Unfortunately I.Can't update it now to 6.x to see if it works on 6.x line.

Summing up..

Expected behavior:
Kibana health check queries:

GET /_nodes?filter_path=nodes.*.version%2Cnodes.*.http.publish_address%2Cnodes.*.ip
and
GET /_nodes/_local?filter_path=nodes.*.settings.tribe

shoudn't cause drammatic growth of direct memory on master nodes.

Actual behavior:
Seems like they do.

Steps to reproduce:

Get a reasonable size cluster on 5.6.3
Start spamming master node with
GET /_nodes?filter_path=nodes.*.version%2Cnodes.*.http.publish_address%2Cnodes.*.ip
and
GET /_nodes/_local?filter_path=nodes.*.settings.tribe
Watch growth of direct buffers size in JVM

Cheers!

The text was updated successfully, but these errors were encountered:

kszarlej · 2018-02-14T17:49:43Z

Looks like someone had similar problem elastic/elasticsearch#26269 (comment)

jbudz · 2018-02-20T17:08:35Z

Thanks for the detailed writeup, no doubt this is way too expensive.

What are they? How can we disable it?

They're health checks kibana sends to make sure the cluster it's connecting to is supported. We're planning on removing it and tracking progress at #14163.

In the meantime you can set elasticsearch.healthCheck.delay to a large number to decrease the frequency. It expects milliseconds.

I'm going to close this out so we can keep the discussion on #14163, I'll drop a link to numbers in that issue.

jbudz · 2018-02-20T17:12:29Z

Here's a shorter term issue too, we could bump the default #13909 and see if any requests can be moved to startup checks.

kszarlej · 2018-02-21T10:10:25Z

thanks @jbudz
is it fine to set it for example to 60 seconds? what can be the consequences?

jbudz · 2018-02-21T14:15:20Z

60 seconds is fine, longer is fine too. The health check will notify you if the cluster goes down, any elasticsearch nodes are on the wrong version, will create the .kibana index if it doesn't exist, and will migrate configuration on an upgrade.

The first two you can be aware of manually. The last two are important but don't need a high frequency (running once on startup is enough)

karan1276 · 2018-11-12T09:56:48Z

Sorry for a silly question, how do i set the property elasticsearch.healthCheck.delay ? Don't see an API for it, set it in elasticsearch.yml file?

sandeepkanabar · 2018-11-13T08:08:06Z

@karan1276 - my guess is you need to set it in kibana.yml file. For e.g in order to set the property to 60 seconds, you could do the following in your kibana.yml file:
elasticsearch.healthCheck.delay: 60000

Bargs added Team:Operations Team label for Operations Team triage_needed labels Feb 14, 2018

jbudz closed this as completed Feb 20, 2018

jbudz mentioned this issue Feb 20, 2018

Remove the Health Check #14163

Closed

5 tasks

jbudz removed the triage_needed label Feb 20, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kibana causes off heap memory problems on elasticsearch masters. #16733

Kibana causes off heap memory problems on elasticsearch masters. #16733

kszarlej commented Feb 14, 2018 •

edited

Loading

kszarlej commented Feb 14, 2018 •

edited

Loading

jbudz commented Feb 20, 2018

jbudz commented Feb 20, 2018

kszarlej commented Feb 21, 2018 •

edited

Loading

jbudz commented Feb 21, 2018

karan1276 commented Nov 12, 2018

sandeepkanabar commented Nov 13, 2018

Kibana causes off heap memory problems on elasticsearch masters. #16733

Kibana causes off heap memory problems on elasticsearch masters. #16733

Comments

kszarlej commented Feb 14, 2018 • edited Loading

kszarlej commented Feb 14, 2018 • edited Loading

jbudz commented Feb 20, 2018

jbudz commented Feb 20, 2018

kszarlej commented Feb 21, 2018 • edited Loading

jbudz commented Feb 21, 2018

karan1276 commented Nov 12, 2018

sandeepkanabar commented Nov 13, 2018

kszarlej commented Feb 14, 2018 •

edited

Loading

kszarlej commented Feb 14, 2018 •

edited

Loading

kszarlej commented Feb 21, 2018 •

edited

Loading