Kafka output performance is worse than expected #2825

keyboardfann · 2017-05-18T04:01:56Z

Dear ALL,
I have metrics 100k/sec. But when I testing the statsd input and kafka output , I only get 600k/min,
Did anyone test telegraf performance or can give me some idea?

Machine type
1 telegraf: cpu 2core , memory 4GB
3 kafka: each server cpu 2core , memory 8GB

Software Version:
Telegraf 1.3.0
Kafka 0.10.2

telegraf config:

[global_tags]
[agent]
  interval = "60s"
  round_interval = true
  metric_batch_size = 600000
  metric_buffer_limit = 10000000
  collection_jitter = "0s"
  flush_interval = "60s"
  flush_jitter = "0s"
  precision = ""
  debug = true
  quiet = false
  logfile = "/var/log/telegraf/telegraf.log"
  hostname = ""
  omit_hostname = true


[[outputs.kafka]]
  brokers = ["10.62.4.160:9092","10.62.4.161:9092","10.62.4.162:9092"]
  topic = "telegraf"
  routing_tag = "host"
  compression_codec = 0
  required_acks = 0
  max_retry = 3
  data_format = "graphite"

[[inputs.statsd]]
  service_address = "10.62.4.159:12004"

  delete_gauges = true
  delete_counters = true
  delete_sets = true
  delete_timings = true
  percentiles = [90]
  metric_separator = "."
  parse_data_dog_tags = false
  allowed_pending_messages = 10000000
  percentile_limit = 1000

Test Script:
https://github.com/octo/statsd-tg

statsd-tg -d 10.62.4.159 -D 12004 -T 2 -s 0 -c 1000000 -t 0 -g 0

Test Result:

2017-05-18T03:40:00Z D! Output [kafka] buffer fullness: 603796 / 10000000 metrics. 
2017-05-18T03:40:59Z D! Output [kafka] wrote batch of 603796 metrics in 59.374510298s

The text was updated successfully, but these errors were encountered:

keyboardfann · 2017-05-18T06:54:37Z

I try socket_writer UDP & TCP output plugin, the UDP metrics performance can be 100k/s and tcp metrics performance can be 50k/s. But kafka output plugin metric performance just only 10k/s..................

Telegraf UDP config:

[global_tags]
[agent]
  interval = "60s"
  round_interval = true
  metric_batch_size = 600000
  metric_buffer_limit = 10000000
  collection_jitter = "0s"
  flush_interval = "60s"
  flush_jitter = "0s"
  precision = ""
  debug = true
  quiet = false
  logfile = "/var/log/telegraf/telegraf.log"
  hostname = ""
  omit_hostname = true

  
[[outputs.socket_writer]]
address = "udp://127.0.0.1:8094"
data_format = "graphite"


[[inputs.statsd]]
  service_address = "10.62.4.162:12004"
  delete_gauges = true
  delete_counters = true
  delete_sets = true
  delete_timings = true
  percentiles = [90]
  metric_separator = "."
  parse_data_dog_tags = false
  allowed_pending_messages = 10000000
  percentile_limit = 1000

Result:

2017-05-18T06:50:08Z D! Output [socket_writer] wrote batch of 470028 metrics in 7.928522448s
2017-05-18T06:50:12Z D! Output [socket_writer] wrote batch of 600000 metrics in 7.39794513s
2017-05-18T06:51:00Z D! Output [socket_writer] buffer fullness: 418543 / 10000000 metrics. 
2017-05-18T06:51:05Z D! Output [socket_writer] wrote batch of 418543 metrics in 5.534527361s
2017-05-18T06:51:10Z D! Output [socket_writer] wrote batch of 600000 metrics in 7.842004589s

Telegraf TCP Config:

...
....
[[outputs.socket_writer]]
address = "tcp://127.0.0.1:8094"
data_format = "graphite"
...
...

Result:

017-05-18T07:01:15Z D! Output [socket_writer] wrote batch of 478660 metrics in 15.150335814s
2017-05-18T07:01:17Z D! Output [socket_writer] wrote batch of 600000 metrics in 14.857339777s
2017-05-18T07:02:00Z D! Output [socket_writer] buffer fullness: 399438 / 10000000 metrics. 
2017-05-18T07:02:04Z D! Output [socket_writer] wrote batch of 399438 metrics in 4.459943777s

danielnelson · 2017-05-18T22:02:44Z

You could try enabling the compression codec.

We try to reserve GitHub Issues for actionable bug reports or feature requests. Please ask general questions at the InfluxData Community site.

keyboardfann · 2017-05-19T02:02:53Z

@danielnelson Thank you, I submit a ticket to InfluxData Community. I also try enable the compress, but the collection will exceed 1 mins

2017-05-19T02:01:00Z E! Error in plugin [inputs.statsd]: took longer to collect than collection interval (1m0s)
2017-05-19T02:02:00Z E! Error in plugin [inputs.statsd]: took longer to collect than collection interval (1m0s)

danielnelson · 2017-05-19T18:50:27Z

I would also try setting this values back to their defaults, I believe this is why the statsd input is reporting that error. What happens is it we buffer so many metrics that it takes too long to flush, during this time the internal metric channel fills and blocks the inputs. If you flush more frequently with less data it will block the input for a shorter period.

  metric_batch_size = 600000
  metric_buffer_limit = 10000000

Defaults are 1000 and 10000.

keyboardfann · 2017-05-23T01:53:22Z

Dear @danielnelson ,
Thank you for the suggestion. I change the metric_batch_size and metric_buffer_limit back to the default value and enable compression, the performance is worse than disable compression (3x sec handle 1000 metrics) and still have some exceed collection time issue. But the disabled compression still slow, 6x ms~2xx ms only can finish 1000 metrics.

telegraf.conf( Enable compression )

[global_tags]
[agent]
  interval = "60s"
  round_interval = true

  metric_batch_size = 1000

  metric_buffer_limit = 10000

  collection_jitter = "0s"

  flush_interval = "60s"
  flush_jitter = "0s"

  precision = ""

  debug = true
  quiet = false
  logfile = "/var/log/telegraf/telegraf.log"

  hostname = ""
  omit_hostname = true

[[outputs.kafka]]
  brokers = ["10.62.4.160:9092","10.62.4.161:9092","10.62.4.162:9092"]
  topic = "telegraf"
  routing_tag = "host"
  compression_codec = 2
  required_acks = 0
  max_retry = 3
  data_format = "graphite"


[[inputs.statsd]]
  service_address = "10.62.4.159:12004"

  delete_gauges = true
  delete_counters = true
  delete_sets = true
  delete_timings = true
  percentiles = [90]
  metric_separator = "."
  parse_data_dog_tags = false
  allowed_pending_messages = 100000
  percentile_limit = 1000

telegraf.log( Enable Compression )

2017-05-23T01:41:39Z I! Starting Telegraf (version 1.3.0)
2017-05-23T01:41:39Z I! Loaded outputs: kafka
2017-05-23T01:41:39Z I! Loaded inputs: inputs.statsd
2017-05-23T01:41:39Z I! Tags enabled: 
2017-05-23T01:41:39Z I! Agent Config: Interval:1m0s, Quiet:false, Hostname:"", Flush Interval:1m0s 
2017-05-23T01:41:39Z I! Started the statsd service on 10.62.4.159:12004
2017-05-23T01:41:39Z I! Statsd listener listening on:  10.62.4.159:12004

2017-05-23T01:42:34Z D! Output [kafka] wrote batch of 1000 metrics in 34.146012646s
2017-05-23T01:43:00Z E! Error in plugin [inputs.statsd]: took longer to collect than collection interval (1m0s)
2017-05-23T01:43:10Z D! Output [kafka] wrote batch of 1000 metrics in 35.728663699s
2017-05-23T01:43:10Z D! Output [kafka] buffer fullness: 101 / 10000 metrics. 
2017-05-23T01:43:14Z D! Output [kafka] wrote batch of 101 metrics in 4.321134247s
2017-05-23T01:43:45Z D! Output [kafka] wrote batch of 1000 metrics in 34.849147541s
2017-05-23T01:44:00Z E! Error in plugin [inputs.statsd]: took longer to collect than collection interval (1m0s)
2017-05-23T01:44:18Z D! Output [kafka] wrote batch of 1000 metrics in 33.740874524s
2017-05-23T01:44:18Z D! Output [kafka] buffer fullness: 101 / 10000 metrics. 
2017-05-23T01:44:22Z D! Output [kafka] wrote batch of 101 metrics in 4.000244478s

telegraf.conf ( DIsabled compression)

2017-05-23T01:49:16Z D! Output [kafka] wrote batch of 101 metrics in 3.274678469s
2017-05-23T01:49:39Z D! Attempting connection to output: kafka
2017-05-23T01:49:39Z D! Successfully connected to output: kafka
2017-05-23T01:49:39Z I! Starting Telegraf (version 1.3.0)
2017-05-23T01:49:39Z I! Loaded outputs: kafka
2017-05-23T01:49:39Z I! Loaded inputs: inputs.statsd
2017-05-23T01:49:39Z I! Tags enabled: 
2017-05-23T01:49:39Z I! Agent Config: Interval:1m0s, Quiet:false, Hostname:"", Flush Interval:1m0s 
2017-05-23T01:49:39Z I! Started the statsd service on 10.62.4.159:12004
2017-05-23T01:49:39Z I! Statsd listener listening on:  10.62.4.159:12004
2017-05-23T01:50:00Z D! Output [kafka] wrote batch of 1000 metrics in 60.63796ms
2017-05-23T01:50:00Z D! Output [kafka] wrote batch of 1000 metrics in 59.632735ms
2017-05-23T01:50:00Z D! Output [kafka] wrote batch of 1000 metrics in 58.111955ms
2017-05-23T01:50:00Z D! Output [kafka] wrote batch of 1000 metrics in 68.970922ms
2017-05-23T01:50:00Z D! Output [kafka] wrote batch of 1000 metrics in 69.698945ms
2017-05-23T01:50:00Z D! Output [kafka] wrote batch of 1000 metrics in 58.801752ms
2017-05-23T01:50:00Z D! Output [kafka] wrote batch of 1000 metrics in 77.740043ms
2017-05-23T01:50:00Z D! Output [kafka] wrote batch of 1000 metrics in 69.050939ms
2017-05-23T01:50:00Z D! Output [kafka] wrote batch of 1000 metrics in 68.511274ms
2017-05-23T01:50:00Z D! Output [kafka] wrote batch of 1000 metrics in 72.308025ms
2017-05-23T01:50:01Z D! Output [kafka] wrote batch of 1000 metrics in 58.838106ms
2017-05-23T01:50:01Z D! Output [kafka] wrote batch of 1000 metrics in 62.046881ms
2017-05-23T01:50:01Z D! Output [kafka] wrote batch of 1000 metrics in 61.875946ms
2017-05-23T01:50:01Z D! Output [kafka] wrote batch of 1000 metrics in 63.217171ms
2017-05-23T01:50:01Z D! Output [kafka] wrote batch of 1000 metrics in 44.602156ms
2017-05-23T01:50:01Z D! Output [kafka] wrote batch of 1000 metrics in 52.071066ms
2017-05-23T01:50:01Z D! Output [kafka] wrote batch of 1000 metrics in 43.949318ms
2017-05-23T01:50:01Z D! Output [kafka] wrote batch of 1000 metrics in 45.746334ms
2017-05-23T01:50:01Z D! Output [kafka] wrote batch of 1000 metrics in 43.105172ms
2017-05-23T01:50:01Z D! Output [kafka] wrote batch of 1000 metrics in 47.755334ms
2017-05-23T01:50:01Z D! Output [kafka] wrote batch of 1000 metrics in 48.779173ms
2017-05-23T01:50:01Z D! Output [kafka] wrote batch of 1000 metrics in 44.056094ms
2017-05-23T01:50:01Z D! Output [kafka] wrote batch of 1000 metrics in 48.682808ms
2017-05-23T01:50:01Z D! Output [kafka] wrote batch of 1000 metrics in 47.640613ms
2017-05-23T01:50:01Z D! Output [kafka] wrote batch of 1000 metrics in 45.628189ms
2017-05-23T01:50:01Z D! Output [kafka] wrote batch of 1000 metrics in 54.141609ms
2017-05-23T01:50:01Z D! Output [kafka] wrote batch of 1000 metrics in 49.613704ms
2017-05-23T01:50:01Z D! Output [kafka] wrote batch of 1000 metrics in 45.344784ms
2017-05-23T01:50:01Z D! Output [kafka] wrote batch of 1000 metrics in 48.739167ms
2017-05-23T01:50:02Z D! Output [kafka] wrote batch of 1000 metrics in 47.372537ms
2017-05-23T01:50:02Z D! Output [kafka] wrote batch of 1000 metrics in 41.691296ms
2017-05-23T01:50:02Z D! Output [kafka] wrote batch of 1000 metrics in 116.082277ms
2017-05-23T01:50:02Z D! Output [kafka] wrote batch of 1000 metrics in 153.724614ms
2017-05-23T01:50:02Z D! Output [kafka] wrote batch of 1000 metrics in 150.281489ms
2017-05-23T01:50:02Z D! Output [kafka] wrote batch of 1000 metrics in 200.805335ms
2017-05-23T01:50:02Z D! Output [kafka] wrote batch of 1000 metrics in 188.87451ms

dennisjbr · 2017-06-15T06:28:20Z

We are seeing the same issue with telegraf 1.3.1. Did testing with a simple PHP script and metrics are posted immediately. But with the telegraf plugin, the performance is abysmal.
Taking 6 minutes to post?? Debug output:

D! Output [kafka] wrote batch of 1000 metrics in 6m40.101144501s

danielnelson · 2017-06-15T21:27:04Z

Wow, 6m is really bad. Do you think you could capture a trace?

If you run telegraf with --pprof-addr :6060 you can get a trace with curl 127.0.0.1:6060/debug/pprof/trace?seconds=60 > trace.out. Run this while it is outputting metrics and upload the trace.out file. You can also look at the trace file youself with go tool trace -http=':7070' trace trace.out

dennisjbr · 2017-06-16T02:28:15Z

Hey Daniel, here is the file. I did it for 5 minutes though, I hope that is OK.
trace.out.zip

Also here is some debug output during the trace.. As you can see it finally wrote to kafka at the 2 minute mark:

2017/06/15 19:18:26 I! Using config file: /etc/telegraf/telegraf.conf
2017/06/15 19:18:26 I! Starting pprof HTTP server at: http://localhost:6060/debug/pprof
2017-06-16T02:18:26Z D! Attempting connection to output: influxdb
2017-06-16T02:18:26Z D! Successfully connected to output: influxdb
2017-06-16T02:18:26Z D! Attempting connection to output: kafka
2017-06-16T02:18:26Z D! Successfully connected to output: kafka
2017-06-16T02:18:26Z I! Starting Telegraf (version 1.3.1)
2017-06-16T02:18:26Z I! Loaded outputs: influxdb kafka
2017-06-16T02:18:26Z I! Loaded inputs: inputs.kernel inputs.mem inputs.processes inputs.system inputs.influxdb inputs.net inputs.cpu inputs.diskio inputs.internal inputs.kapacitor inputs.ntpq inputs.disk inputs.swap
2017-06-16T02:18:26Z I! Tags enabled: host=data-1 hv=test-hv-value
2017-06-16T02:18:26Z I! Agent Config: Interval:10s, Quiet:false, Hostname:"data-1", Flush Interval:10s
2017-06-16T02:18:41Z D! Output [influxdb] buffer fullness: 255 / 10000 metrics.
2017-06-16T02:18:41Z D! Output [kafka] buffer fullness: 255 / 10000 metrics.
2017-06-16T02:18:41Z D! Output [influxdb] wrote batch of 255 metrics in 13.89314ms
2017-06-16T02:18:50Z W! Skipping a scheduled flush because there is already a flush ongoing.
2017-06-16T02:19:00Z W! Skipping a scheduled flush because there is already a flush ongoing.
2017-06-16T02:19:10Z W! Skipping a scheduled flush because there is already a flush ongoing.
2017-06-16T02:19:20Z W! Skipping a scheduled flush because there is already a flush ongoing.
2017-06-16T02:19:30Z W! Skipping a scheduled flush because there is already a flush ongoing.
2017-06-16T02:19:40Z W! Skipping a scheduled flush because there is already a flush ongoing.
2017-06-16T02:19:50Z W! Skipping a scheduled flush because there is already a flush ongoing.
2017-06-16T02:20:00Z D! Output [influxdb] wrote batch of 1000 metrics in 22.074081ms
2017-06-16T02:20:00Z W! Skipping a scheduled flush because there is already a flush ongoing.
2017-06-16T02:20:30Z E! Error in plugin [inputs.diskio]: took longer to collect than collection interval (10s)
2017-06-16T02:20:30Z E! Error in plugin [inputs.processes]: took longer to collect than collection interval (10s)
2017-06-16T02:20:30Z E! Error in plugin [inputs.influxdb]: took longer to collect than collection interval (10s)
2017-06-16T02:20:30Z E! Error in plugin [inputs.system]: took longer to collect than collection interval (10s)
2017-06-16T02:20:30Z E! Error in plugin [inputs.kapacitor]: took longer to collect than collection interval (10s)
2017-06-16T02:20:30Z E! Error in plugin [inputs.cpu]: took longer to collect than collection interval (10s)
2017-06-16T02:20:30Z E! Error in plugin [inputs.ntpq]: took longer to collect than collection interval (10s)
2017-06-16T02:20:30Z E! Error in plugin [inputs.disk]: took longer to collect than collection interval (10s)
2017-06-16T02:20:30Z E! Error in plugin [inputs.internal]: took longer to collect than collection interval (10s)
2017-06-16T02:20:40Z E! Error in plugin [inputs.diskio]: took longer to collect than collection interval (10s)
2017-06-16T02:20:40Z E! Error in plugin [inputs.swap]: took longer to collect than collection interval (10s)
2017-06-16T02:20:40Z E! Error in plugin [inputs.mem]: took longer to collect than collection interval (10s)
2017-06-16T02:20:40Z E! Error in plugin [inputs.processes]: took longer to collect than collection interval (10s)
2017-06-16T02:20:40Z E! Error in plugin [inputs.influxdb]: took longer to collect than collection interval (10s)
2017-06-16T02:20:40Z E! Error in plugin [inputs.kernel]: took longer to collect than collection interval (10s)
2017-06-16T02:20:40Z E! Error in plugin [inputs.net]: took longer to collect than collection interval (10s)
2017-06-16T02:20:40Z E! Error in plugin [inputs.system]: took longer to collect than collection interval (10s)
2017-06-16T02:20:40Z E! Error in plugin [inputs.kapacitor]: took longer to collect than collection interval (10s)
2017-06-16T02:20:40Z E! Error in plugin [inputs.cpu]: took longer to collect than collection interval (10s)
2017-06-16T02:20:40Z E! Error in plugin [inputs.ntpq]: took longer to collect than collection interval (10s)
2017-06-16T02:20:40Z E! Error in plugin [inputs.disk]: took longer to collect than collection interval (10s)
2017-06-16T02:20:40Z E! Error in plugin [inputs.internal]: took longer to collect than collection interval (10s)
2017-06-16T02:20:41Z D! Output [kafka] wrote batch of 255 metrics in 2m0.190538615s

dennisjbr · 2017-06-16T19:27:30Z

Sorry should have tagged you @danielnelson

danielnelson · 2017-06-16T20:12:41Z

Looking through the trace all the time is all being spent in the sarama library. We did update this library recently though, do you think you could try with the latest nightly?

dennisjbr · 2017-06-16T23:51:05Z

@danielnelson sure I can test it out sometime this weekend. Let me know if I should reach out to via Email. Mention my name to Colm and he can get you the deets.

danielnelson · 2017-06-17T01:09:38Z

@dennisjbr Can you add your kafka output's configuration?

dennisjbr · 2017-06-18T19:59:23Z

@danielnelson FYI I replaced IP address with y.y.y.y, replaced topic with "topic", and removed commented lines out for brevity.

[[outputs.kafka]]
brokers = ["y.y.y.y:9092"]
topic = "topic"
max_retry = 3
data_format = "influx"

dennisjbr · 2017-06-21T01:38:57Z

@danielnelson No workie. Still is hung up on kafka and it hangs up the whole agent and stops sending data to all outputs. That 3rd party library is looking questionable ;-) Thanks in advance.

danielnelson · 2018-08-01T23:43:05Z

Performance should be much improved in 1.8 with the changes made in #4491. If possible can you help test the development builds with the nightly builds?

hxinhan · 2018-09-10T13:30:02Z

@danielnelson I have the same problem running release 1.7.1. While the performance is indeed much better in 1.8 with changes in #4491! When shall we roll out 1.8 release?

glinton · 2018-09-10T17:44:34Z

@HansonHH 1.8 should be released by the end of the month with an RC coming before that.

hxinhan · 2018-09-10T17:56:32Z

@glinton Thanks for the info! Looking forward to 1.8 with the latest and greatest!

danielnelson closed this as completed May 18, 2017

danielnelson changed the title ~~Telegraf performance use statsd input plugin & kafka output plugin~~ Kafka output performance is worse than expected May 23, 2017

danielnelson reopened this May 23, 2017

danielnelson added area/kafka performance problems with decreased performance or enhancements that improve performance labels Jun 8, 2017

russorat mentioned this issue May 31, 2018

Kafka output plugin should support batching at the producer level #4161

Closed

russorat added this to the 1.8.0 milestone Jun 21, 2018

russorat assigned danielnelson Jul 30, 2018

danielnelson closed this as completed Aug 1, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kafka output performance is worse than expected #2825

Kafka output performance is worse than expected #2825

keyboardfann commented May 18, 2017 •

edited

Loading

keyboardfann commented May 18, 2017 •

edited

Loading

danielnelson commented May 18, 2017

keyboardfann commented May 19, 2017

danielnelson commented May 19, 2017

keyboardfann commented May 23, 2017 •

edited

Loading

dennisjbr commented Jun 15, 2017

danielnelson commented Jun 15, 2017

dennisjbr commented Jun 16, 2017

dennisjbr commented Jun 16, 2017

danielnelson commented Jun 16, 2017

dennisjbr commented Jun 16, 2017

danielnelson commented Jun 17, 2017

dennisjbr commented Jun 18, 2017 •

edited

Loading

dennisjbr commented Jun 21, 2017

danielnelson commented Aug 1, 2018

hxinhan commented Sep 10, 2018

glinton commented Sep 10, 2018

hxinhan commented Sep 10, 2018

Kafka output performance is worse than expected #2825

Kafka output performance is worse than expected #2825

Comments

keyboardfann commented May 18, 2017 • edited Loading

keyboardfann commented May 18, 2017 • edited Loading

danielnelson commented May 18, 2017

keyboardfann commented May 19, 2017

danielnelson commented May 19, 2017

keyboardfann commented May 23, 2017 • edited Loading

dennisjbr commented Jun 15, 2017

danielnelson commented Jun 15, 2017

dennisjbr commented Jun 16, 2017

dennisjbr commented Jun 16, 2017

danielnelson commented Jun 16, 2017

dennisjbr commented Jun 16, 2017

danielnelson commented Jun 17, 2017

dennisjbr commented Jun 18, 2017 • edited Loading

dennisjbr commented Jun 21, 2017

danielnelson commented Aug 1, 2018

hxinhan commented Sep 10, 2018

glinton commented Sep 10, 2018

hxinhan commented Sep 10, 2018

keyboardfann commented May 18, 2017 •

edited

Loading

keyboardfann commented May 18, 2017 •

edited

Loading

keyboardfann commented May 23, 2017 •

edited

Loading

dennisjbr commented Jun 18, 2017 •

edited

Loading