[0.12.1-1] influxd Memory leak #6449

ngortheone · 2016-04-22T08:16:10Z

System info:

This is Centos 6.7 box with all latest updates installed.

➜ ~ cat /etc/redhat-release
CentOS release 6.7 (Final)

➜ ~ uname -a
Linux myhost.com 2.6.32-573.12.1.el6.x86_64 #1 SMP Tue Dec 15 21:19:08 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

➜ ~ rpm -qa influxdb
influxdb-0.12.1-1.x86_64

➜ ~ cat /proc/meminfo
MemTotal: 15293128 kB

Steps to reproduce:

6 telegrafs are writing data to this instance
Every night issue happens with no effort from my side

Expected behavior: no memory leak
Actual behavior: memory leak

Additional info:
dmesg log attached
influxdb log attached (non-relevant messages removed such as regular requests)
grapsh demonstrating memory leak, exact time when it happens around 3 am

dmesg.txt
influx_db_logs.txt

Problem is 100% reproducable, happens every night. I start my day with restarting influxdb service :)
Let me know if I can supply more relevant info

e-dard · 2016-04-22T12:06:33Z

@ngortheone thanks for the report. A memory profile of the running influx server might be useful. You can generate one by running:

influxd -memprofile /path/to/save/profile (you can add in any other arguments you usually use).

I can't remember if profiles get screwed up if the process crashes while they're been taken so maybe run it for as much of a day as possible and then shutdown cleanly to capture it.

ngortheone · 2016-04-23T16:32:43Z

@e-dard
Stopped service when it consumed 99% of RAM
See attached memory profile

profile.zip

e-dard · 2016-04-25T12:37:15Z

@ngortheone thanks for the profile. Couple more questions:

What sort of volume is this instance receiving? How many requests per second roughly?
What are the typical sizes of the payloads being written to influx?

ngortheone · 2016-04-25T12:43:13Z

@e-dard

6 telegrafs are sending data to this instance.
All 6 have default out-of-the-box configurations.
Nothing was tuned apart from inputs section.

Here is the full copy input section that is used on all 6 telegrafs.

[[inputs.cpu]]
  percpu = true
  totalcpu = true
  fielddrop = ["time_*"]

[[inputs.disk]]
  ignore_fs = ["tmpfs", "devtmpfs"]

[[inputs.diskio]]

[[inputs.kernel]]

[[inputs.mem]]

[[inputs.processes]]

[[inputs.swap]]

[[inputs.system]]

[[inputs.net]]

I hope this will give you some clues about volume of data that is being received.
If this info is not sufficient and you need more accurate measurements of incoming data - please let me know

e-dard · 2016-04-25T16:09:48Z

@ngortheone Looking at the memory profile, around 45GB has been allocated within Go's gzip package during the lifetime of this profile. If it's over roughly 24hours I don't think 6 telegraf instances with the above config would generate enough data.

Are you sure nothing else is writing to the influx server? Can you give a rough idea of how much data is being written over a 24h period?

ngortheone · 2016-04-25T16:15:51Z

@e-dard
I will be glad to provide you more detailed info.

Are you sure nothing else is writing to the influx server?

at least nothing I know of.

an you give a rough idea of how much data is being written over a 24h period?

I will collect 'df -h' stats with 24 hours difference.

How else can I get this info? maybe full influxdb log or database dump?

e-dard · 2016-04-25T16:17:24Z

full influxdb logs would be useful, yes!

jwilder · 2016-04-25T16:19:19Z

Maybe #6425 is related?

ngortheone · 2016-04-25T16:43:45Z

@e-dard
Here is a bunch of logs for past week
logs.zip

And here is disk space usage graph for this period

e-dard · 2016-04-25T18:04:26Z

@ngortheone thanks! I assume myproject is the DB Telegraf is writing to?

Have you checked your Telegraf logs? It looks like there could be issues there with writes failing. The reason I think that might be the case is telegraf appears to be trying to recreate a DB on influx.

Of course, there is still some issue here with how Influx is handling that issue, but it could be down to this issue, which has been fixed.

Telegraf logs would be useful.

ngortheone · 2016-04-25T18:13:35Z

@e-dard

I assume myproject is the DB Telegraf is writing to?

Correct. I will collect telegraf logs and let you know.

ngortheone · 2016-04-25T18:30:30Z

@e-dard
here are logs from 2 telegrafs 7 days each.
I can see some kind of write errors there. Please see

telegraf-logs.zip

Also
➜ ~ rpm -qa telegraf
telegraf-0.12.0-1.x86_64
➜ ~ sudo yum update telegraf
...
---> Package telegraf.x86_64 0:0.12.0-1 will be updated
---> Package telegraf.x86_64 0:0.12.1-1 will be an update

Is this a fix to #6425 ?

avinocur · 2016-04-25T19:39:32Z

Hi, we're having a very similar issue, although we're not using telegraf, but posting custom metrics directly using the java client.
We're currently running on two ubuntu servers with 8 cores and 16GB RAM, using influxdb-relay as a write proxy to both and influxdb version 0.12.1.
We're recording metrics to 4 different measurements and running hourly continuous queries on 3 of them. We have 7 measurements and 976641 series on our database as this moment, and handling roughly 20 writes per second.

We're experiencing a continuous increase in memory consumption, until finally the database crashes with an out of memory error. This happened before on boxes with 8GB of RAM. We migrated to 16GB thinking that it might stabilize at some point, but that didn't happen.

This is the memory use of the last couple of hours since we restarted the server:

We implemented a temporary solution with a cron to restart the server when it crashes (checks every minute if influx is running), but that produces some inconsistencies between the two servers because of metrics posted during that minute of downtime. Is there any way to configure influxdb-relay to have a cache and replay the writes performed during this short downtime? I saw that was available for UDP, but couldn't manage to configure it for the HTTP interface.

Please let me know if you think this is unrelated and should be on a separate issue.

Regards,
Adrian

dvenza · 2016-04-26T07:58:10Z

I have the "out of memory" crash too. I have about 20 instances of telegraf sending data, plus some more metrics sent via HTTP. The crash does not happen every day, sometimes yes, sometimes two or three days pass with no problems.
Influxdb 0.12.1 and 16GB of RAM.
I have no errors in telegraf logs (sometimes it fails to gather Docker stats, but I think it is unrelated).

e-dard · 2016-04-26T09:41:13Z

@ngortheone

here are logs from 2 telegrafs 7 days each.
I can see some kind of write errors there. Please see

Yep all these write errors are likely to be the issue here. Due to #6425 and influxdata/telegraf#1061 connections are left hanging around, and there are buffers allocated for each of these connections. These allocations are what I'm seeing in the memory profile, and why you're OOMing.

The quickest way to resolve this issue is to fix the problem that's causing Telegraf to fail to write to Influx. It looks like some sort of timeout—are you sure your Telegraf instances can reach the influx box? You should fix that and then this whole issue will go away.

In terms of the bug in Influx it was fixed in #6425 and will be in the 0.13 release. Telegraf's next release will also include influxdata/telegraf#1061 too.

@ngortheone once you get your Telegraf instances working properly let me know if it works out OK, and we can close this issue. 👍

e-dard · 2016-04-26T09:43:56Z

@dvenza I think it could be the same issue. Do you see lines in your InfluxDB log along the lines of:

[query] 2016/04/19 03:50:30 CREATE DATABASE IF NOT EXISTS DB_USED_BY_TELEGRAF
[http] 2016/04/19 03:50:30 10.17.128.180 - - [19/Apr/2016:03:50:30 +0300] GET /query?db=&q=CREATE+DATABASE+IF+NOT+EXISTS+%22DB_USED_BY_TELEGRAF%22 HTTP/1.1 200 40 - InfluxDBClient b6ef9665-05c8-11e6-83d2-000000000000 730.248µs

??

e-dard · 2016-04-26T09:44:44Z

@avinocur Can you confirm if you have a significant number of write errors in your Java client? This issue looks to be caused by a TCP connection leak due to write errors from Telegraf, but other clients could probably cause the same issue.

dvenza · 2016-04-26T10:09:42Z

Yes, I see those lines, I will try to understand what is failing on the telegraf side. When will 0.13 be released?

e-dard · 2016-04-26T10:11:34Z

@dvenza Likely the first week of May. Of course, fixing the issue your Telegraf writes are failing will resolve this issue before then 😄

avinocur · 2016-04-26T15:17:51Z

@e-dard Surprisingly, yes! We're performing 3 retries and were monitoring errors, but it seems to be retrying a lot due to 500 errors on the server. So it's most likely a related issue.
On that matter, how would you check on the influxdb server for the root cause of these errors? Or maybe are there any common reasons for a server to reject writes? Load on the server is pretty low, even during the execution of the continuous queries, so I'm a little confused regarding where to look next...

Thanks in advance!

e-dard · 2016-04-26T15:21:10Z

@avinocur there are not common reasons to reject writes. Sounds like it may be worth opening a new issue with relevant logs/data/setup.

ngortheone · 2016-04-26T19:30:30Z

@e-dard
I have updated telegrafs from 0.12.0-1 to 0.12.1-1. No signs of memory leak yet.
I will watch the system for a couple of days. If there is no memory leak - we will close this is issue.

ngortheone · 2016-04-28T07:50:20Z

It seems that after updating telegrafs memory no longer leaks. Thanks @e-dard and everyone for the help. Closing the issue

e-dard · 2016-04-28T10:07:25Z

@ngortheone great! Glad to hear it's sorted out 👍

jwilder added the area/performance label Apr 22, 2016

jwilder added this to the 0.13.0 milestone Apr 22, 2016

e-dard self-assigned this Apr 25, 2016

ngortheone closed this as completed Apr 28, 2016

This was referenced Apr 28, 2016

Reaching ulimit #6185

Closed

[0.12.0] Goroutine leak with intense write load and shard in state "cache maximum memory size exceeded" #6417

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[0.12.1-1] influxd Memory leak #6449

[0.12.1-1] influxd Memory leak #6449

ngortheone commented Apr 22, 2016

e-dard commented Apr 22, 2016

ngortheone commented Apr 23, 2016 •

edited

Loading

e-dard commented Apr 25, 2016

ngortheone commented Apr 25, 2016 •

edited

Loading

e-dard commented Apr 25, 2016

ngortheone commented Apr 25, 2016 •

edited

Loading

e-dard commented Apr 25, 2016

jwilder commented Apr 25, 2016

ngortheone commented Apr 25, 2016 •

edited

Loading

e-dard commented Apr 25, 2016

ngortheone commented Apr 25, 2016 •

edited

Loading

ngortheone commented Apr 25, 2016 •

edited

Loading

avinocur commented Apr 25, 2016

dvenza commented Apr 26, 2016

e-dard commented Apr 26, 2016

e-dard commented Apr 26, 2016

e-dard commented Apr 26, 2016

dvenza commented Apr 26, 2016

e-dard commented Apr 26, 2016 •

edited

Loading

avinocur commented Apr 26, 2016

e-dard commented Apr 26, 2016

ngortheone commented Apr 26, 2016

ngortheone commented Apr 28, 2016

e-dard commented Apr 28, 2016

[0.12.1-1] influxd Memory leak #6449

[0.12.1-1] influxd Memory leak #6449

Comments

ngortheone commented Apr 22, 2016

e-dard commented Apr 22, 2016

ngortheone commented Apr 23, 2016 • edited Loading

e-dard commented Apr 25, 2016

ngortheone commented Apr 25, 2016 • edited Loading

e-dard commented Apr 25, 2016

ngortheone commented Apr 25, 2016 • edited Loading

e-dard commented Apr 25, 2016

jwilder commented Apr 25, 2016

ngortheone commented Apr 25, 2016 • edited Loading

e-dard commented Apr 25, 2016

ngortheone commented Apr 25, 2016 • edited Loading

ngortheone commented Apr 25, 2016 • edited Loading

avinocur commented Apr 25, 2016

dvenza commented Apr 26, 2016

e-dard commented Apr 26, 2016

e-dard commented Apr 26, 2016

e-dard commented Apr 26, 2016

dvenza commented Apr 26, 2016

e-dard commented Apr 26, 2016 • edited Loading

avinocur commented Apr 26, 2016

e-dard commented Apr 26, 2016

ngortheone commented Apr 26, 2016

ngortheone commented Apr 28, 2016

e-dard commented Apr 28, 2016

ngortheone commented Apr 23, 2016 •

edited

Loading

ngortheone commented Apr 25, 2016 •

edited

Loading

ngortheone commented Apr 25, 2016 •

edited

Loading

ngortheone commented Apr 25, 2016 •

edited

Loading

ngortheone commented Apr 25, 2016 •

edited

Loading

ngortheone commented Apr 25, 2016 •

edited

Loading

e-dard commented Apr 26, 2016 •

edited

Loading