-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[0.12.1-1] influxd Memory leak #6449
Comments
@ngortheone thanks for the report. A memory profile of the running influx server might be useful. You can generate one by running:
I can't remember if profiles get screwed up if the process crashes while they're been taken so maybe run it for as much of a day as possible and then shutdown cleanly to capture it. |
@e-dard |
@ngortheone thanks for the profile. Couple more questions:
|
6 telegrafs are sending data to this instance. Here is the full copy input section that is used on all 6 telegrafs.
I hope this will give you some clues about volume of data that is being received. |
@ngortheone Looking at the memory profile, around 45GB has been allocated within Go's gzip package during the lifetime of this profile. If it's over roughly 24hours I don't think 6 telegraf instances with the above config would generate enough data. Are you sure nothing else is writing to the influx server? Can you give a rough idea of how much data is being written over a 24h period? |
@e-dard
at least nothing I know of.
I will collect 'df -h' stats with 24 hours difference. How else can I get this info? maybe full influxdb log or database dump? |
full influxdb logs would be useful, yes! |
Maybe #6425 is related? |
@ngortheone thanks! I assume Have you checked your Telegraf logs? It looks like there could be issues there with writes failing. The reason I think that might be the case is telegraf appears to be trying to recreate a DB on influx. Of course, there is still some issue here with how Influx is handling that issue, but it could be down to this issue, which has been fixed. Telegraf logs would be useful. |
Correct. I will collect telegraf logs and let you know. |
@e-dard Also Is this a fix to #6425 ? |
Hi, we're having a very similar issue, although we're not using telegraf, but posting custom metrics directly using the java client. We're experiencing a continuous increase in memory consumption, until finally the database crashes with an out of memory error. This happened before on boxes with 8GB of RAM. We migrated to 16GB thinking that it might stabilize at some point, but that didn't happen. This is the memory use of the last couple of hours since we restarted the server: We implemented a temporary solution with a cron to restart the server when it crashes (checks every minute if influx is running), but that produces some inconsistencies between the two servers because of metrics posted during that minute of downtime. Is there any way to configure influxdb-relay to have a cache and replay the writes performed during this short downtime? I saw that was available for UDP, but couldn't manage to configure it for the HTTP interface. Please let me know if you think this is unrelated and should be on a separate issue. Regards, |
I have the "out of memory" crash too. I have about 20 instances of telegraf sending data, plus some more metrics sent via HTTP. The crash does not happen every day, sometimes yes, sometimes two or three days pass with no problems. |
Yep all these write errors are likely to be the issue here. Due to #6425 and influxdata/telegraf#1061 connections are left hanging around, and there are buffers allocated for each of these connections. These allocations are what I'm seeing in the memory profile, and why you're OOMing. The quickest way to resolve this issue is to fix the problem that's causing Telegraf to fail to write to Influx. It looks like some sort of timeout—are you sure your Telegraf instances can reach the influx box? You should fix that and then this whole issue will go away. In terms of the bug in Influx it was fixed in #6425 and will be in the @ngortheone once you get your Telegraf instances working properly let me know if it works out OK, and we can close this issue. 👍 |
@dvenza I think it could be the same issue. Do you see lines in your InfluxDB log along the lines of:
?? |
@avinocur Can you confirm if you have a significant number of write errors in your Java client? This issue looks to be caused by a TCP connection leak due to write errors from Telegraf, but other clients could probably cause the same issue. |
Yes, I see those lines, I will try to understand what is failing on the telegraf side. When will 0.13 be released? |
@dvenza Likely the first week of May. Of course, fixing the issue your Telegraf writes are failing will resolve this issue before then 😄 |
@e-dard Surprisingly, yes! We're performing 3 retries and were monitoring errors, but it seems to be retrying a lot due to 500 errors on the server. So it's most likely a related issue. Thanks in advance! |
@avinocur there are not common reasons to reject writes. Sounds like it may be worth opening a new issue with relevant logs/data/setup. |
@e-dard |
It seems that after updating telegrafs memory no longer leaks. Thanks @e-dard and everyone for the help. Closing the issue |
@ngortheone great! Glad to hear it's sorted out 👍 |
System info:
This is Centos 6.7 box with all latest updates installed.
➜ ~ cat /etc/redhat-release
CentOS release 6.7 (Final)
➜ ~ uname -a
Linux myhost.com 2.6.32-573.12.1.el6.x86_64 #1 SMP Tue Dec 15 21:19:08 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
➜ ~ rpm -qa influxdb
influxdb-0.12.1-1.x86_64
➜ ~ cat /proc/meminfo
MemTotal: 15293128 kB
Steps to reproduce:
Expected behavior: no memory leak
Actual behavior: memory leak
Additional info:
dmesg log attached
influxdb log attached (non-relevant messages removed such as regular requests)
grapsh demonstrating memory leak, exact time when it happens around 3 am
dmesg.txt
influx_db_logs.txt
Problem is 100% reproducable, happens every night. I start my day with restarting influxdb service :)
Let me know if I can supply more relevant info
The text was updated successfully, but these errors were encountered: