Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance Regression Between 17.2 and 19.3 #9849

Closed
ihdavids opened this issue Oct 1, 2021 · 2 comments
Closed

Performance Regression Between 17.2 and 19.3 #9849

ihdavids opened this issue Oct 1, 2021 · 2 comments
Labels
area/elasticsearch bug unexpected problem or unintended behavior

Comments

@ihdavids
Copy link

ihdavids commented Oct 1, 2021

Relevant telegraf.conf:

Some things have been removed for clarity.

global_tags]
[agent]
interval = "10s"
round_interval = true
metric_batch_size = 300
metric_buffer_limit = 10000
collection_jitter = "0s"
flush_interval = "10s"
flush_jitter = "5s"
precision = ""
logfile = "/var/log/telegraf/telegraf.log"
hostname = ""
omit_hostname = false

###############################################################################
#                            INPUT PLUGINS                                    #
###############################################################################

[[inputs.http_listener_v2]]
service_address = ":8086"
path = "/"
methods   = ["POST", "PUT"]
data_format = "influx"

[[inputs.internal]]

[[inputs.cpu]]
percpu = false
totalcpu = true
report_active = true
name_override = "internal_cpu"

###############################################################################
#                            PROCESSOR PLUGINS                                #
###############################################################################

# # Uses a Go template to create a new tag 
[[processors.template]]
order = 1
tag = "name_tag"
template = '{{ .Name }}'

[[processors.strings]]
order = 2
[[processors.strings.lowercase]]
tag = "name_tag"

[[processors.strings]]
order = 3
    [[processors.strings.trim_suffix]]
    field_key = "*"
    suffix  = "_sum"
    [[processors.strings.replace]]
    tag_key = "*"
    old   = "."
    new   = "_"
    [[processors.strings.replace]]
    field_key = "*"
    old     = "."
    new     = "_"
[[processors.converter]]
    [processors.converter.fields]
    tag = ["something"]

[[processors.rename]]
    [[processors.rename.replace]]
    tag = "host"
    dest = "telegraf_host"

[[processors.regex]]
    [[processors.regex.fields]]
    key       = "something"
    pattern   = "[^/]*/(.*)"
    replacement = "${1}"


###############################################################################
##                            AGGREGATOR PLUGINS                               #
################################################################################
#
[[aggregators.basicstats]]
period = "120s"
drop_original = true
stats = ["sum"]
fielddrop = ["acoupleoffields"]
[aggregators.basicstats.tagpass]
type = ["anotherfield"]
[[aggregators.basicstats]]
period = "15s"
stats  = ["diff"]
namepass = ["internal_write"]

###############################################################################
#                            OUTPUT PLUGINS                                   #
###############################################################################

[[outputs.influxdb]]
urls = [""]
database = "default" # required    
retention_policy = "one_day"
write_consistency = "any"
precision       = "s"
namepass = [
  "SomeNames",
]
tagexclude = ["name_tag", "othertags"]

[[outputs.elasticsearch]]
urls     = ["an_es_cluster"]
index_name = "{{name_tag}}-%Y.%m.%d"
flush_interval = "15s"
force_document_id = true
namedrop = ["ACoupleOfNames"]

System info:

Telegraf 19.3 vs Telegraf 17.2 of 17.3
Running on relatively vanilla Ubuntu 20.04
Same machine for both configurations, built with same version of go.

Steps to reproduce:

  1. Build Telegraf 19.3, notice that with the same elastic and influx outputs queue lengths are double the size and our drop rate increased by about 35%
  2. Same behaviour has been seen on 19.2, 19.1, 18.3 but not on 17.3

Expected behavior:

Equivalent throughput.

Actual behavior:

Substantially lower throughput and a high rate of metrics dropped

Additional info:

Here is an example image showing 17.2 vs 19.3. The write buffer size is substantially higher:
image

As a result our drop rate is much higher. Here is a comparison of the drop rate from the older node vs the newer node:
image
image

I have slowly been working my way back through telegraf versions. I know the behaviour is not visible on telegraf 17.3 and 17.2 but happens at 18.3, I have yet to work my way back through 18.x versions older than 18.3. Here is 18.3

image

For now this means that the highest we can upgrade to is 17.3 which is really disappointing.

@ihdavids ihdavids added the bug unexpected problem or unintended behavior label Oct 1, 2021
@powersj
Copy link
Contributor

powersj commented Oct 1, 2021

Hi! Thanks for taking the time to report this and with lots of details. We believe we just fixed this with #9800. If you want, you might try one of our nightly builds or wait till v1.20.1, which should land next week.

I am going to go ahead and close this as this matches what that issue found and fixed, but if you are able to try with v1.20.1 after next week and still run into issues we would love to know.

Thanks!

@powersj powersj closed this as completed Oct 1, 2021
@ihdavids
Copy link
Author

ihdavids commented Oct 1, 2021

I will certainly try this with 1.20.1! Thank you for the quick response.
It was very disheartening to realize we were blocked from upgrading and can't get all the latest telegraf goodness!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/elasticsearch bug unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

2 participants