regression in 1.20.0 influxdb output #9802

anatolijd · 2021-09-22T20:09:07Z

hi,

I've upgraded telegraf recently from 1.16.1 to the latest 1.20.0 , and got a weird behavior.
No changes were done to telegraf configuration, only binary updated.

I've noticed some elasticsearch metrics missed in influxdb, when investing it I found these messages in telegraf log:

Sep 22 01:57:01 es-8702e telegraf[9428]: 2021-09-22T01:57:01Z W! [outputs.influxdb] Metric buffer overflow; 1485 metrics have been dropped
Sep 22 01:58:00 es-8702e telegraf[9428]: 2021-09-22T01:58:00Z W! [outputs.influxdb] Metric buffer overflow; 1285 metrics have been dropped
Sep 22 01:59:00 es-8702e telegraf[9428]: 2021-09-22T01:59:00Z W! [outputs.influxdb] Metric buffer overflow; 1485 metrics have been dropped

I tried to increase metric_buffer_limit setting several times, set it to 10000, 20000. But it didn't help much, buffer overflow messages continue to appear in the logs again (though, with some longer delay after telegraf restart). It gone after I raised metric_buffer_limit to 40K.

To investigate it, I enabled internal plugin and here is the picture what i've got:

number of added metrics is stable at 2.13L, while number of written metrics is lower, ~ 1.25K with periodical spikes over 15-30K.
buf_size is 0 most of the time, i think it should show added/written difference summarized.
memstats alloc_bytes - it goes really high and correlates with the written spikes.

Step 2,
I downgraded telegraf back to 1.16.1 (blue line at the graph), and it really helped - equal values for added/written, and much better memory alloc_bytes profile.

Relevant telegraf.conf:

# cat /etc/telegraf/telegraf.conf
[agent]
flush_interval = "60s"
flush_jitter = "5s"
interval = "60s"
metric_batch_size = 400
metric_buffer_limit = 40000
round_interval = true

[[inputs.elasticsearch]]
cluster_health = true
cluster_stats = true
cluster_stats_only_from_master = true
http_timeout = "5s"
indices_include = ["_all"]
indices_level = "cluster"
insecure_skip_verify = true
local = false
namedrop = ["elasticsearch_indices_stats_primaries"]
node_stats = ["indices", "jvm", "thread_pool", "breaker"]
servers = ["https://monitor:password@localhost:9200"]

[[outputs.influxdb]]
database = "elasticsearch"
username = "user"
password = "password"
precision = "s"
skip_database_creation = true
timeout = "5s"
urls = ["http://influx.domain:8086"]

System info:

es-8702e is a master node in quite large Elasticsearch cluster.
hundreds of ES data nodes, hundreds of indexes and over 20K shards - amount of data reported (interval = 60s) is huge, see -test output size :

# /usr/bin/telegraf -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d -test > test
2021-09-22T19:20:23Z I! Starting Telegraf 1.16.1

# ls -lh test
-rw-r--r-- 1 root root 4.1M Sep 22 19:20 test

# wc -l test
2132 test

We run other smaller clusters, with elasticsearch input plugin installed at their master nodes, and they don't have this problem on telegraf-1.20.0. Only this one node, which collects lots of ES metrics via elasticsearch input plugin is affected.

Expected behavior:

telegraf-1.20.0 should not require metric_buffer_limit=40000 to make it working again.
It works pretty well in version 1.16.1 (haven't tested other).

If internal metric value written is lower than added then I would expect buf_size to grow up and report actual buffer size and not 0.

Actual behavior:

Additional info:

Is not reproducible with low amount of data.
Possibly, it's related to #9526

The text was updated successfully, but these errors were encountered:

anatolijd · 2021-09-22T20:59:38Z

#9799 is a possible cause

sjwang90 · 2021-09-29T15:47:45Z

Hi @anatolijd - We actually have gotten a lot of feedback about this and I believe you're encountering the same issue as this #9531.

We have a PR #9800 ready for test if you can test the feature from the artifacts. Please comment on the PR if it resolves your problems and this will help us get it merged.

I'm going to close this issue for redundancy.

anatolijd added the bug unexpected problem or unintended behavior label Sep 22, 2021

telegraf-tiger bot added the area/elasticsearch label Sep 22, 2021

anatolijd changed the title ~~regression in 1.20.1 influxdb output~~ regression in 1.20.0 influxdb output Sep 22, 2021

sjwang90 closed this as completed Sep 29, 2021

anatolijd mentioned this issue Oct 4, 2021

fix: revert reset of ticker #9800

Merged

3 tasks

Hipska linked a pull request Oct 5, 2021 that will close this issue

fix: revert reset of ticker #9800

Merged

3 tasks

powersj mentioned this issue Jan 21, 2022

inputs_mysql.conf gather perf events is false. Still generating mysql perf metrics. #10251

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

regression in 1.20.0 influxdb output #9802

regression in 1.20.0 influxdb output #9802

anatolijd commented Sep 22, 2021 •

edited

Loading

anatolijd commented Sep 22, 2021

sjwang90 commented Sep 29, 2021

regression in 1.20.0 influxdb output #9802

regression in 1.20.0 influxdb output #9802

Comments

anatolijd commented Sep 22, 2021 • edited Loading

Relevant telegraf.conf:

System info:

Expected behavior:

Actual behavior:

Additional info:

anatolijd commented Sep 22, 2021

sjwang90 commented Sep 29, 2021

anatolijd commented Sep 22, 2021 •

edited

Loading