Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

regression in 1.20.0 influxdb output #9802

Closed
anatolijd opened this issue Sep 22, 2021 · 2 comments · Fixed by #9800
Closed

regression in 1.20.0 influxdb output #9802

anatolijd opened this issue Sep 22, 2021 · 2 comments · Fixed by #9800
Labels
area/elasticsearch bug unexpected problem or unintended behavior

Comments

@anatolijd
Copy link

anatolijd commented Sep 22, 2021

hi,

I've upgraded telegraf recently from 1.16.1 to the latest 1.20.0 , and got a weird behavior.
No changes were done to telegraf configuration, only binary updated.

I've noticed some elasticsearch metrics missed in influxdb, when investing it I found these messages in telegraf log:

Sep 22 01:57:01 es-8702e telegraf[9428]: 2021-09-22T01:57:01Z W! [outputs.influxdb] Metric buffer overflow; 1485 metrics have been dropped
Sep 22 01:58:00 es-8702e telegraf[9428]: 2021-09-22T01:58:00Z W! [outputs.influxdb] Metric buffer overflow; 1285 metrics have been dropped
Sep 22 01:59:00 es-8702e telegraf[9428]: 2021-09-22T01:59:00Z W! [outputs.influxdb] Metric buffer overflow; 1485 metrics have been dropped

I tried to increase metric_buffer_limit setting several times, set it to 10000, 20000. But it didn't help much, buffer overflow messages continue to appear in the logs again (though, with some longer delay after telegraf restart). It gone after I raised metric_buffer_limit to 40K.

To investigate it, I enabled internal plugin and here is the picture what i've got:

Screenshot from 2021-09-22 22-13-41

  • number of added metrics is stable at 2.13L, while number of written metrics is lower, ~ 1.25K with periodical spikes over 15-30K.
  • buf_size is 0 most of the time, i think it should show added/written difference summarized.
  • memstats alloc_bytes - it goes really high and correlates with the written spikes.

Step 2,
I downgraded telegraf back to 1.16.1 (blue line at the graph), and it really helped - equal values for added/written, and much better memory alloc_bytes profile.

Relevant telegraf.conf:

# cat /etc/telegraf/telegraf.conf
[agent]
flush_interval = "60s"
flush_jitter = "5s"
interval = "60s"
metric_batch_size = 400
metric_buffer_limit = 40000
round_interval = true

[[inputs.elasticsearch]]
cluster_health = true
cluster_stats = true
cluster_stats_only_from_master = true
http_timeout = "5s"
indices_include = ["_all"]
indices_level = "cluster"
insecure_skip_verify = true
local = false
namedrop = ["elasticsearch_indices_stats_primaries"]
node_stats = ["indices", "jvm", "thread_pool", "breaker"]
servers = ["https://monitor:password@localhost:9200"]

[[outputs.influxdb]]
database = "elasticsearch"
username = "user"
password = "password"
precision = "s"
skip_database_creation = true
timeout = "5s"
urls = ["http://influx.domain:8086"]

System info:

es-8702e is a master node in quite large Elasticsearch cluster.
hundreds of ES data nodes, hundreds of indexes and over 20K shards - amount of data reported (interval = 60s) is huge, see -test output size :

# /usr/bin/telegraf -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d -test > test
2021-09-22T19:20:23Z I! Starting Telegraf 1.16.1

# ls -lh test
-rw-r--r-- 1 root root 4.1M Sep 22 19:20 test

# wc -l test
2132 test

We run other smaller clusters, with elasticsearch input plugin installed at their master nodes, and they don't have this problem on telegraf-1.20.0. Only this one node, which collects lots of ES metrics via elasticsearch input plugin is affected.

Expected behavior:

telegraf-1.20.0 should not require metric_buffer_limit=40000 to make it working again.
It works pretty well in version 1.16.1 (haven't tested other).

If internal metric value written is lower than added then I would expect buf_size to grow up and report actual buffer size and not 0.

Actual behavior:

Additional info:

Is not reproducible with low amount of data.
Possibly, it's related to #9526

@anatolijd anatolijd added the bug unexpected problem or unintended behavior label Sep 22, 2021
@anatolijd anatolijd changed the title regression in 1.20.1 influxdb output regression in 1.20.0 influxdb output Sep 22, 2021
@anatolijd
Copy link
Author

#9799 is a possible cause

@sjwang90
Copy link
Contributor

Hi @anatolijd - We actually have gotten a lot of feedback about this and I believe you're encountering the same issue as this #9531.

We have a PR #9800 ready for test if you can test the feature from the artifacts. Please comment on the PR if it resolves your problems and this will help us get it merged.

I'm going to close this issue for redundancy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/elasticsearch bug unexpected problem or unintended behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants