You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've upgraded telegraf recently from 1.16.1 to the latest 1.20.0 , and got a weird behavior.
No changes were done to telegraf configuration, only binary updated.
I've noticed some elasticsearch metrics missed in influxdb, when investing it I found these messages in telegraf log:
Sep 22 01:57:01 es-8702e telegraf[9428]: 2021-09-22T01:57:01Z W! [outputs.influxdb] Metric buffer overflow; 1485 metrics have been dropped
Sep 22 01:58:00 es-8702e telegraf[9428]: 2021-09-22T01:58:00Z W! [outputs.influxdb] Metric buffer overflow; 1285 metrics have been dropped
Sep 22 01:59:00 es-8702e telegraf[9428]: 2021-09-22T01:59:00Z W! [outputs.influxdb] Metric buffer overflow; 1485 metrics have been dropped
I tried to increase metric_buffer_limit setting several times, set it to 10000, 20000. But it didn't help much, buffer overflow messages continue to appear in the logs again (though, with some longer delay after telegraf restart). It gone after I raised metric_buffer_limit to 40K.
To investigate it, I enabled internal plugin and here is the picture what i've got:
number of added metrics is stable at 2.13L, while number of written metrics is lower, ~ 1.25K with periodical spikes over 15-30K.
buf_size is 0 most of the time, i think it should show added/written difference summarized.
memstats alloc_bytes - it goes really high and correlates with the written spikes.
Step 2,
I downgraded telegraf back to 1.16.1 (blue line at the graph), and it really helped - equal values for added/written, and much better memory alloc_bytes profile.
es-8702e is a master node in quite large Elasticsearch cluster.
hundreds of ES data nodes, hundreds of indexes and over 20K shards - amount of data reported (interval = 60s) is huge, see -test output size :
# /usr/bin/telegraf -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d -test > test
2021-09-22T19:20:23Z I! Starting Telegraf 1.16.1
# ls -lh test
-rw-r--r-- 1 root root 4.1M Sep 22 19:20 test
# wc -l test
2132 test
We run other smaller clusters, with elasticsearch input plugin installed at their master nodes, and they don't have this problem on telegraf-1.20.0. Only this one node, which collects lots of ES metrics via elasticsearch input plugin is affected.
Expected behavior:
telegraf-1.20.0 should not require metric_buffer_limit=40000 to make it working again.
It works pretty well in version 1.16.1 (haven't tested other).
If internal metric value written is lower than added then I would expect buf_size to grow up and report actual buffer size and not 0.
Actual behavior:
Additional info:
Is not reproducible with low amount of data.
Possibly, it's related to #9526
The text was updated successfully, but these errors were encountered:
Hi @anatolijd - We actually have gotten a lot of feedback about this and I believe you're encountering the same issue as this #9531.
We have a PR #9800 ready for test if you can test the feature from the artifacts. Please comment on the PR if it resolves your problems and this will help us get it merged.
hi,
I've upgraded telegraf recently from 1.16.1 to the latest 1.20.0 , and got a weird behavior.
No changes were done to telegraf configuration, only binary updated.
I've noticed some elasticsearch metrics missed in influxdb, when investing it I found these messages in telegraf log:
I tried to increase
metric_buffer_limit
setting several times, set it to 10000, 20000. But it didn't help much, buffer overflow messages continue to appear in the logs again (though, with some longer delay after telegraf restart). It gone after I raisedmetric_buffer_limit
to 40K.To investigate it, I enabled internal plugin and here is the picture what i've got:
added
metrics is stable at 2.13L, while number ofwritten
metrics is lower, ~ 1.25K with periodical spikes over 15-30K.buf_size
is 0 most of the time, i think it should showadded/written
difference summarized.written
spikes.Step 2,
I downgraded telegraf back to 1.16.1 (blue line at the graph), and it really helped - equal values for added/written, and much better memory alloc_bytes profile.
Relevant telegraf.conf:
System info:
es-8702e is a master node in quite large Elasticsearch cluster.
hundreds of ES data nodes, hundreds of indexes and over 20K shards - amount of data reported (
interval = 60s
) is huge, see-test
output size :We run other smaller clusters, with elasticsearch input plugin installed at their master nodes, and they don't have this problem on telegraf-1.20.0. Only this one node, which collects lots of ES metrics via elasticsearch input plugin is affected.
Expected behavior:
telegraf-1.20.0 should not require
metric_buffer_limit=40000
to make it working again.It works pretty well in version 1.16.1 (haven't tested other).
If internal metric value
written
is lower thanadded
then I would expectbuf_size
to grow up and report actual buffer size and not 0.Actual behavior:
Additional info:
Is not reproducible with low amount of data.
Possibly, it's related to #9526
The text was updated successfully, but these errors were encountered: