-
Notifications
You must be signed in to change notification settings - Fork 8.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kibana causes off heap memory problems on elasticsearch masters. #16733
Comments
Looks like someone had similar problem elastic/elasticsearch#26269 (comment) |
Thanks for the detailed writeup, no doubt this is way too expensive.
They're health checks kibana sends to make sure the cluster it's connecting to is supported. We're planning on removing it and tracking progress at #14163. In the meantime you can set I'm going to close this out so we can keep the discussion on #14163, I'll drop a link to numbers in that issue. |
Here's a shorter term issue too, we could bump the default #13909 and see if any requests can be moved to startup checks. |
thanks @jbudz |
60 seconds is fine, longer is fine too. The health check will notify you if the cluster goes down, any elasticsearch nodes are on the wrong version, will create the .kibana index if it doesn't exist, and will migrate configuration on an upgrade. The first two you can be aware of manually. The last two are important but don't need a high frequency (running once on startup is enough) |
Sorry for a silly question, how do i set the property |
@karan1276 - my guess is you need to set it in |
Kibana version:
5.6.3 (ocker pull docker.elastic.co/kibana/kibana:5.6.3)
Elasticsearch version:
5.6.3 (docker.elastic.co/elasticsearch/elasticsearch:5.6.3)
Server OS version:
Ubuntu 14.04 LTS
Browser version:
Not relevant to issue
Browser OS version:
Not relevant to issue
Original install method (e.g. download page, yum, from source, etc.):
Docker images from docker.elastic.co registry.
Description of the problem including expected versus actual behavior:
The issue was also carefully described in this topic: https://discuss.elastic.co/t/elasticsearch-5-6-very-quickly-increasing-direct-buffer-pools/119561
I noticed on my master nodes that their Direct Memory Buffers Size are raising very quickly. Before I limited MaxDirectMemorySize it was causing memory exhaustion on master nodes (even when heap was only 30% full) . Off heap memory (direct memory) was driving the RSS size of JAVA process to exceed 99% of available RAM and master nodes were dying either by swapping or after disabling swap because of OOMs.
Currently I fixed this issues with memory exhaustion by setting -XX:MaxDirectMemorySize option to 20% of RAM.
However now because of the very fast growth of direct memory JAVA is saving itself by very often requesting ExplicitGCs. Currently I have 700mb limit for -XX:MaxDirectMemorySize, this memory fills up in about 3-4 minutes. When limit is hit the ExplicitGC is run to clean these buffers.
On a chart this looks as follows:
At the same time DirectMemoryBuffers chart
Each red annotation line is ExplicitGC caused by the -XX:MaxDirectMemorySize limit.
And now the kibana part :) Whenever I turn off kibana the direct memory growth on master node freezes. I excluded possibility of some strange query because:
Looking at the traffic between nodes in cluster I made a following break down:
Then I started dumping traffic. essearch1 is sending to master thousands of "empty" packets. They look like just a TCP header with empty payload. There are thousands of them and they are showing in dump according to some interval, all at once. Seems like an job with defined interval like health check.
I wanted to know the source of these so I sniffed to inbound traffic from kibana to essearch1 box and noticed lots of packets from kibana which seem to be appearing on same interval as packets sent from essearch1 to master1. We are using HTTP LB on the front of essearches so I checked access logs. Here are all extracted queries that kibana makes regularly to essearches. They seem like health checks + some unknown to me actions.
Now on LB I started dropping these queries while observing the direct memory buffers to see what happens in order to find which query causes problems. I started with /_nodes prefix cause these queries are having some wildcards and are looking suspicious.
And at this moment buffers size freezed just like on below chart:
So I narrowed down problematic queries to be:
GET /_nodes?filter_path=nodes.*.version%2Cnodes.*.http.publish_address%2Cnodes.*.ip
and
GET /_nodes/_local?filter_path=nodes.*.settings.tribe
What are they? How can we disable it?
I don't think that my cluster is misconfigured (of course that always might be the case :P), we are using default configuration that comes with the docker images (we use official docker images) and we have only modified Xms and Xmx to fit to size of our boxes and MaxDirectMemorySize mentioned on the beggining of issue description.
I tested it also on kibana 5.6.7 - the same problem. On 5.3.0 this problem seems to not exist. Unfortunately I.Can't update it now to 6.x to see if it works on 6.x line.
Summing up..
Expected behavior:
Kibana health check queries:
GET /_nodes?filter_path=nodes.*.version%2Cnodes.*.http.publish_address%2Cnodes.*.ip
and
GET /_nodes/_local?filter_path=nodes.*.settings.tribe
shoudn't cause drammatic growth of direct memory on master nodes.
Actual behavior:
Seems like they do.
Steps to reproduce:
GET /_nodes?filter_path=nodes.*.version%2Cnodes.*.http.publish_address%2Cnodes.*.ip
and
GET /_nodes/_local?filter_path=nodes.*.settings.tribe
Cheers!
The text was updated successfully, but these errors were encountered: