Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[0.12.1-1] influxd Memory leak #6449

Closed
ngortheone opened this issue Apr 22, 2016 · 24 comments
Closed

[0.12.1-1] influxd Memory leak #6449

ngortheone opened this issue Apr 22, 2016 · 24 comments
Assignees
Milestone

Comments

@ngortheone
Copy link

System info:

This is Centos 6.7 box with all latest updates installed.

➜ ~ cat /etc/redhat-release
CentOS release 6.7 (Final)

➜ ~ uname -a
Linux myhost.com 2.6.32-573.12.1.el6.x86_64 #1 SMP Tue Dec 15 21:19:08 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

➜ ~ rpm -qa influxdb
influxdb-0.12.1-1.x86_64

➜ ~ cat /proc/meminfo
MemTotal: 15293128 kB

Steps to reproduce:

  1. 6 telegrafs are writing data to this instance
  2. Every night issue happens with no effort from my side

Expected behavior: no memory leak
Actual behavior: memory leak

Additional info:
dmesg log attached
influxdb log attached (non-relevant messages removed such as regular requests)
grapsh demonstrating memory leak, exact time when it happens around 3 am

dmesg.txt
influx_db_logs.txt
screen shot 2016-04-22 at 10 42 14 am
screen shot 2016-04-22 at 10 45 51 am
screen shot 2016-04-22 at 11 02 53 am

Problem is 100% reproducable, happens every night. I start my day with restarting influxdb service :)
Let me know if I can supply more relevant info

@e-dard
Copy link
Contributor

e-dard commented Apr 22, 2016

@ngortheone thanks for the report. A memory profile of the running influx server might be useful. You can generate one by running:

influxd -memprofile /path/to/save/profile (you can add in any other arguments you usually use).

I can't remember if profiles get screwed up if the process crashes while they're been taken so maybe run it for as much of a day as possible and then shutdown cleanly to capture it.

@jwilder jwilder added this to the 0.13.0 milestone Apr 22, 2016
@ngortheone
Copy link
Author

ngortheone commented Apr 23, 2016

@e-dard
Stopped service when it consumed 99% of RAM
See attached memory profile

profile.zip

@e-dard
Copy link
Contributor

e-dard commented Apr 25, 2016

@ngortheone thanks for the profile. Couple more questions:

  • What sort of volume is this instance receiving? How many requests per second roughly?
  • What are the typical sizes of the payloads being written to influx?

@ngortheone
Copy link
Author

ngortheone commented Apr 25, 2016

@e-dard

6 telegrafs are sending data to this instance.
All 6 have default out-of-the-box configurations.
Nothing was tuned apart from inputs section.

Here is the full copy input section that is used on all 6 telegrafs.

[[inputs.cpu]]
  percpu = true
  totalcpu = true
  fielddrop = ["time_*"]

[[inputs.disk]]
  ignore_fs = ["tmpfs", "devtmpfs"]

[[inputs.diskio]]

[[inputs.kernel]]

[[inputs.mem]]

[[inputs.processes]]

[[inputs.swap]]

[[inputs.system]]

[[inputs.net]]

I hope this will give you some clues about volume of data that is being received.
If this info is not sufficient and you need more accurate measurements of incoming data - please let me know

@e-dard e-dard self-assigned this Apr 25, 2016
@e-dard
Copy link
Contributor

e-dard commented Apr 25, 2016

@ngortheone Looking at the memory profile, around 45GB has been allocated within Go's gzip package during the lifetime of this profile. If it's over roughly 24hours I don't think 6 telegraf instances with the above config would generate enough data.

Are you sure nothing else is writing to the influx server? Can you give a rough idea of how much data is being written over a 24h period?

@ngortheone
Copy link
Author

ngortheone commented Apr 25, 2016

@e-dard
I will be glad to provide you more detailed info.

Are you sure nothing else is writing to the influx server?

at least nothing I know of.

an you give a rough idea of how much data is being written over a 24h period?

I will collect 'df -h' stats with 24 hours difference.

How else can I get this info? maybe full influxdb log or database dump?

@e-dard
Copy link
Contributor

e-dard commented Apr 25, 2016

full influxdb logs would be useful, yes!

@jwilder
Copy link
Contributor

jwilder commented Apr 25, 2016

Maybe #6425 is related?

@ngortheone
Copy link
Author

ngortheone commented Apr 25, 2016

@e-dard
Here is a bunch of logs for past week
logs.zip

And here is disk space usage graph for this period
screen shot 2016-04-25 at 7 42 37 pm

@e-dard
Copy link
Contributor

e-dard commented Apr 25, 2016

@ngortheone thanks! I assume myproject is the DB Telegraf is writing to?

Have you checked your Telegraf logs? It looks like there could be issues there with writes failing. The reason I think that might be the case is telegraf appears to be trying to recreate a DB on influx.

Of course, there is still some issue here with how Influx is handling that issue, but it could be down to this issue, which has been fixed.

Telegraf logs would be useful.

@ngortheone
Copy link
Author

ngortheone commented Apr 25, 2016

@e-dard

I assume myproject is the DB Telegraf is writing to?

Correct. I will collect telegraf logs and let you know.

@ngortheone
Copy link
Author

ngortheone commented Apr 25, 2016

@e-dard
here are logs from 2 telegrafs 7 days each.
I can see some kind of write errors there. Please see

telegraf-logs.zip

Also
➜ ~ rpm -qa telegraf
telegraf-0.12.0-1.x86_64
➜ ~ sudo yum update telegraf
...
---> Package telegraf.x86_64 0:0.12.0-1 will be updated
---> Package telegraf.x86_64 0:0.12.1-1 will be an update

Is this a fix to #6425 ?

@avinocur
Copy link

Hi, we're having a very similar issue, although we're not using telegraf, but posting custom metrics directly using the java client.
We're currently running on two ubuntu servers with 8 cores and 16GB RAM, using influxdb-relay as a write proxy to both and influxdb version 0.12.1.
We're recording metrics to 4 different measurements and running hourly continuous queries on 3 of them. We have 7 measurements and 976641 series on our database as this moment, and handling roughly 20 writes per second.

We're experiencing a continuous increase in memory consumption, until finally the database crashes with an out of memory error. This happened before on boxes with 8GB of RAM. We migrated to 16GB thinking that it might stabilize at some point, but that didn't happen.

This is the memory use of the last couple of hours since we restarted the server:

memory-influx

We implemented a temporary solution with a cron to restart the server when it crashes (checks every minute if influx is running), but that produces some inconsistencies between the two servers because of metrics posted during that minute of downtime. Is there any way to configure influxdb-relay to have a cache and replay the writes performed during this short downtime? I saw that was available for UDP, but couldn't manage to configure it for the HTTP interface.

Please let me know if you think this is unrelated and should be on a separate issue.

Regards,
Adrian

@dvenza
Copy link

dvenza commented Apr 26, 2016

I have the "out of memory" crash too. I have about 20 instances of telegraf sending data, plus some more metrics sent via HTTP. The crash does not happen every day, sometimes yes, sometimes two or three days pass with no problems.
Influxdb 0.12.1 and 16GB of RAM.
I have no errors in telegraf logs (sometimes it fails to gather Docker stats, but I think it is unrelated).

@e-dard
Copy link
Contributor

e-dard commented Apr 26, 2016

@ngortheone

here are logs from 2 telegrafs 7 days each.
I can see some kind of write errors there. Please see

Yep all these write errors are likely to be the issue here. Due to #6425 and influxdata/telegraf#1061 connections are left hanging around, and there are buffers allocated for each of these connections. These allocations are what I'm seeing in the memory profile, and why you're OOMing.

The quickest way to resolve this issue is to fix the problem that's causing Telegraf to fail to write to Influx. It looks like some sort of timeout—are you sure your Telegraf instances can reach the influx box? You should fix that and then this whole issue will go away.

In terms of the bug in Influx it was fixed in #6425 and will be in the 0.13 release. Telegraf's next release will also include influxdata/telegraf#1061 too.

@ngortheone once you get your Telegraf instances working properly let me know if it works out OK, and we can close this issue. 👍

@e-dard
Copy link
Contributor

e-dard commented Apr 26, 2016

@dvenza I think it could be the same issue. Do you see lines in your InfluxDB log along the lines of:

[query] 2016/04/19 03:50:30 CREATE DATABASE IF NOT EXISTS DB_USED_BY_TELEGRAF
[http] 2016/04/19 03:50:30 10.17.128.180 - - [19/Apr/2016:03:50:30 +0300] GET /query?db=&q=CREATE+DATABASE+IF+NOT+EXISTS+%22DB_USED_BY_TELEGRAF%22 HTTP/1.1 200 40 - InfluxDBClient b6ef9665-05c8-11e6-83d2-000000000000 730.248µs

??

@e-dard
Copy link
Contributor

e-dard commented Apr 26, 2016

@avinocur Can you confirm if you have a significant number of write errors in your Java client? This issue looks to be caused by a TCP connection leak due to write errors from Telegraf, but other clients could probably cause the same issue.

@dvenza
Copy link

dvenza commented Apr 26, 2016

Yes, I see those lines, I will try to understand what is failing on the telegraf side. When will 0.13 be released?

@e-dard
Copy link
Contributor

e-dard commented Apr 26, 2016

@dvenza Likely the first week of May. Of course, fixing the issue your Telegraf writes are failing will resolve this issue before then 😄

@avinocur
Copy link

@e-dard Surprisingly, yes! We're performing 3 retries and were monitoring errors, but it seems to be retrying a lot due to 500 errors on the server. So it's most likely a related issue.
On that matter, how would you check on the influxdb server for the root cause of these errors? Or maybe are there any common reasons for a server to reject writes? Load on the server is pretty low, even during the execution of the continuous queries, so I'm a little confused regarding where to look next...

Thanks in advance!

@e-dard
Copy link
Contributor

e-dard commented Apr 26, 2016

@avinocur there are not common reasons to reject writes. Sounds like it may be worth opening a new issue with relevant logs/data/setup.

@ngortheone
Copy link
Author

@e-dard
I have updated telegrafs from 0.12.0-1 to 0.12.1-1. No signs of memory leak yet.
I will watch the system for a couple of days. If there is no memory leak - we will close this is issue.

@ngortheone
Copy link
Author

It seems that after updating telegrafs memory no longer leaks. Thanks @e-dard and everyone for the help. Closing the issue

@e-dard
Copy link
Contributor

e-dard commented Apr 28, 2016

@ngortheone great! Glad to hear it's sorted out 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants