Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kafka output performance is worse than expected #2825

Closed
keyboardfann opened this issue May 18, 2017 · 18 comments
Closed

Kafka output performance is worse than expected #2825

keyboardfann opened this issue May 18, 2017 · 18 comments
Assignees
Labels
area/kafka performance problems with decreased performance or enhancements that improve performance
Milestone

Comments

@keyboardfann
Copy link

keyboardfann commented May 18, 2017

Dear ALL,
I have metrics 100k/sec. But when I testing the statsd input and kafka output , I only get 600k/min,
Did anyone test telegraf performance or can give me some idea?

Machine type
1 telegraf: cpu 2core , memory 4GB
3 kafka: each server cpu 2core , memory 8GB

Software Version:
Telegraf 1.3.0
Kafka 0.10.2

telegraf config:

[global_tags]
[agent]
  interval = "60s"
  round_interval = true
  metric_batch_size = 600000
  metric_buffer_limit = 10000000
  collection_jitter = "0s"
  flush_interval = "60s"
  flush_jitter = "0s"
  precision = ""
  debug = true
  quiet = false
  logfile = "/var/log/telegraf/telegraf.log"
  hostname = ""
  omit_hostname = true


[[outputs.kafka]]
  brokers = ["10.62.4.160:9092","10.62.4.161:9092","10.62.4.162:9092"]
  topic = "telegraf"
  routing_tag = "host"
  compression_codec = 0
  required_acks = 0
  max_retry = 3
  data_format = "graphite"

[[inputs.statsd]]
  service_address = "10.62.4.159:12004"

  delete_gauges = true
  delete_counters = true
  delete_sets = true
  delete_timings = true
  percentiles = [90]
  metric_separator = "."
  parse_data_dog_tags = false
  allowed_pending_messages = 10000000
  percentile_limit = 1000

Test Script:
https://github.com/octo/statsd-tg

statsd-tg -d 10.62.4.159 -D 12004 -T 2 -s 0 -c 1000000 -t 0 -g 0

Test Result:

2017-05-18T03:40:00Z D! Output [kafka] buffer fullness: 603796 / 10000000 metrics. 
2017-05-18T03:40:59Z D! Output [kafka] wrote batch of 603796 metrics in 59.374510298s
@keyboardfann
Copy link
Author

keyboardfann commented May 18, 2017

I try socket_writer UDP & TCP output plugin, the UDP metrics performance can be 100k/s and tcp metrics performance can be 50k/s. But kafka output plugin metric performance just only 10k/s..................

Telegraf UDP config:

[global_tags]
[agent]
  interval = "60s"
  round_interval = true
  metric_batch_size = 600000
  metric_buffer_limit = 10000000
  collection_jitter = "0s"
  flush_interval = "60s"
  flush_jitter = "0s"
  precision = ""
  debug = true
  quiet = false
  logfile = "/var/log/telegraf/telegraf.log"
  hostname = ""
  omit_hostname = true

  
[[outputs.socket_writer]]
address = "udp://127.0.0.1:8094"
data_format = "graphite"


[[inputs.statsd]]
  service_address = "10.62.4.162:12004"
  delete_gauges = true
  delete_counters = true
  delete_sets = true
  delete_timings = true
  percentiles = [90]
  metric_separator = "."
  parse_data_dog_tags = false
  allowed_pending_messages = 10000000
  percentile_limit = 1000

Result:

2017-05-18T06:50:08Z D! Output [socket_writer] wrote batch of 470028 metrics in 7.928522448s
2017-05-18T06:50:12Z D! Output [socket_writer] wrote batch of 600000 metrics in 7.39794513s
2017-05-18T06:51:00Z D! Output [socket_writer] buffer fullness: 418543 / 10000000 metrics. 
2017-05-18T06:51:05Z D! Output [socket_writer] wrote batch of 418543 metrics in 5.534527361s
2017-05-18T06:51:10Z D! Output [socket_writer] wrote batch of 600000 metrics in 7.842004589s

Telegraf TCP Config:

...
....
[[outputs.socket_writer]]
address = "tcp://127.0.0.1:8094"
data_format = "graphite"
...
...

Result:

017-05-18T07:01:15Z D! Output [socket_writer] wrote batch of 478660 metrics in 15.150335814s
2017-05-18T07:01:17Z D! Output [socket_writer] wrote batch of 600000 metrics in 14.857339777s
2017-05-18T07:02:00Z D! Output [socket_writer] buffer fullness: 399438 / 10000000 metrics. 
2017-05-18T07:02:04Z D! Output [socket_writer] wrote batch of 399438 metrics in 4.459943777s

@danielnelson
Copy link
Contributor

You could try enabling the compression codec.

We try to reserve GitHub Issues for actionable bug reports or feature requests. Please ask general questions at the InfluxData Community site.

@keyboardfann
Copy link
Author

@danielnelson Thank you, I submit a ticket to InfluxData Community. I also try enable the compress, but the collection will exceed 1 mins

2017-05-19T02:01:00Z E! Error in plugin [inputs.statsd]: took longer to collect than collection interval (1m0s)
2017-05-19T02:02:00Z E! Error in plugin [inputs.statsd]: took longer to collect than collection interval (1m0s)

@danielnelson
Copy link
Contributor

I would also try setting this values back to their defaults, I believe this is why the statsd input is reporting that error. What happens is it we buffer so many metrics that it takes too long to flush, during this time the internal metric channel fills and blocks the inputs. If you flush more frequently with less data it will block the input for a shorter period.

  metric_batch_size = 600000
  metric_buffer_limit = 10000000

Defaults are 1000 and 10000.

@keyboardfann
Copy link
Author

keyboardfann commented May 23, 2017

Dear @danielnelson ,
Thank you for the suggestion. I change the metric_batch_size and metric_buffer_limit back to the default value and enable compression, the performance is worse than disable compression (3x sec handle 1000 metrics) and still have some exceed collection time issue. But the disabled compression still slow, 6x ms~2xx ms only can finish 1000 metrics.

telegraf.conf( Enable compression )

[global_tags]
[agent]
  interval = "60s"
  round_interval = true

  metric_batch_size = 1000

  metric_buffer_limit = 10000

  collection_jitter = "0s"

  flush_interval = "60s"
  flush_jitter = "0s"

  precision = ""

  debug = true
  quiet = false
  logfile = "/var/log/telegraf/telegraf.log"

  hostname = ""
  omit_hostname = true

[[outputs.kafka]]
  brokers = ["10.62.4.160:9092","10.62.4.161:9092","10.62.4.162:9092"]
  topic = "telegraf"
  routing_tag = "host"
  compression_codec = 2
  required_acks = 0
  max_retry = 3
  data_format = "graphite"


[[inputs.statsd]]
  service_address = "10.62.4.159:12004"

  delete_gauges = true
  delete_counters = true
  delete_sets = true
  delete_timings = true
  percentiles = [90]
  metric_separator = "."
  parse_data_dog_tags = false
  allowed_pending_messages = 100000
  percentile_limit = 1000

telegraf.log( Enable Compression )

2017-05-23T01:41:39Z I! Starting Telegraf (version 1.3.0)
2017-05-23T01:41:39Z I! Loaded outputs: kafka
2017-05-23T01:41:39Z I! Loaded inputs: inputs.statsd
2017-05-23T01:41:39Z I! Tags enabled: 
2017-05-23T01:41:39Z I! Agent Config: Interval:1m0s, Quiet:false, Hostname:"", Flush Interval:1m0s 
2017-05-23T01:41:39Z I! Started the statsd service on 10.62.4.159:12004
2017-05-23T01:41:39Z I! Statsd listener listening on:  10.62.4.159:12004

2017-05-23T01:42:34Z D! Output [kafka] wrote batch of 1000 metrics in 34.146012646s
2017-05-23T01:43:00Z E! Error in plugin [inputs.statsd]: took longer to collect than collection interval (1m0s)
2017-05-23T01:43:10Z D! Output [kafka] wrote batch of 1000 metrics in 35.728663699s
2017-05-23T01:43:10Z D! Output [kafka] buffer fullness: 101 / 10000 metrics. 
2017-05-23T01:43:14Z D! Output [kafka] wrote batch of 101 metrics in 4.321134247s
2017-05-23T01:43:45Z D! Output [kafka] wrote batch of 1000 metrics in 34.849147541s
2017-05-23T01:44:00Z E! Error in plugin [inputs.statsd]: took longer to collect than collection interval (1m0s)
2017-05-23T01:44:18Z D! Output [kafka] wrote batch of 1000 metrics in 33.740874524s
2017-05-23T01:44:18Z D! Output [kafka] buffer fullness: 101 / 10000 metrics. 
2017-05-23T01:44:22Z D! Output [kafka] wrote batch of 101 metrics in 4.000244478s

telegraf.conf ( DIsabled compression)

2017-05-23T01:49:16Z D! Output [kafka] wrote batch of 101 metrics in 3.274678469s
2017-05-23T01:49:39Z D! Attempting connection to output: kafka
2017-05-23T01:49:39Z D! Successfully connected to output: kafka
2017-05-23T01:49:39Z I! Starting Telegraf (version 1.3.0)
2017-05-23T01:49:39Z I! Loaded outputs: kafka
2017-05-23T01:49:39Z I! Loaded inputs: inputs.statsd
2017-05-23T01:49:39Z I! Tags enabled: 
2017-05-23T01:49:39Z I! Agent Config: Interval:1m0s, Quiet:false, Hostname:"", Flush Interval:1m0s 
2017-05-23T01:49:39Z I! Started the statsd service on 10.62.4.159:12004
2017-05-23T01:49:39Z I! Statsd listener listening on:  10.62.4.159:12004
2017-05-23T01:50:00Z D! Output [kafka] wrote batch of 1000 metrics in 60.63796ms
2017-05-23T01:50:00Z D! Output [kafka] wrote batch of 1000 metrics in 59.632735ms
2017-05-23T01:50:00Z D! Output [kafka] wrote batch of 1000 metrics in 58.111955ms
2017-05-23T01:50:00Z D! Output [kafka] wrote batch of 1000 metrics in 68.970922ms
2017-05-23T01:50:00Z D! Output [kafka] wrote batch of 1000 metrics in 69.698945ms
2017-05-23T01:50:00Z D! Output [kafka] wrote batch of 1000 metrics in 58.801752ms
2017-05-23T01:50:00Z D! Output [kafka] wrote batch of 1000 metrics in 77.740043ms
2017-05-23T01:50:00Z D! Output [kafka] wrote batch of 1000 metrics in 69.050939ms
2017-05-23T01:50:00Z D! Output [kafka] wrote batch of 1000 metrics in 68.511274ms
2017-05-23T01:50:00Z D! Output [kafka] wrote batch of 1000 metrics in 72.308025ms
2017-05-23T01:50:01Z D! Output [kafka] wrote batch of 1000 metrics in 58.838106ms
2017-05-23T01:50:01Z D! Output [kafka] wrote batch of 1000 metrics in 62.046881ms
2017-05-23T01:50:01Z D! Output [kafka] wrote batch of 1000 metrics in 61.875946ms
2017-05-23T01:50:01Z D! Output [kafka] wrote batch of 1000 metrics in 63.217171ms
2017-05-23T01:50:01Z D! Output [kafka] wrote batch of 1000 metrics in 44.602156ms
2017-05-23T01:50:01Z D! Output [kafka] wrote batch of 1000 metrics in 52.071066ms
2017-05-23T01:50:01Z D! Output [kafka] wrote batch of 1000 metrics in 43.949318ms
2017-05-23T01:50:01Z D! Output [kafka] wrote batch of 1000 metrics in 45.746334ms
2017-05-23T01:50:01Z D! Output [kafka] wrote batch of 1000 metrics in 43.105172ms
2017-05-23T01:50:01Z D! Output [kafka] wrote batch of 1000 metrics in 47.755334ms
2017-05-23T01:50:01Z D! Output [kafka] wrote batch of 1000 metrics in 48.779173ms
2017-05-23T01:50:01Z D! Output [kafka] wrote batch of 1000 metrics in 44.056094ms
2017-05-23T01:50:01Z D! Output [kafka] wrote batch of 1000 metrics in 48.682808ms
2017-05-23T01:50:01Z D! Output [kafka] wrote batch of 1000 metrics in 47.640613ms
2017-05-23T01:50:01Z D! Output [kafka] wrote batch of 1000 metrics in 45.628189ms
2017-05-23T01:50:01Z D! Output [kafka] wrote batch of 1000 metrics in 54.141609ms
2017-05-23T01:50:01Z D! Output [kafka] wrote batch of 1000 metrics in 49.613704ms
2017-05-23T01:50:01Z D! Output [kafka] wrote batch of 1000 metrics in 45.344784ms
2017-05-23T01:50:01Z D! Output [kafka] wrote batch of 1000 metrics in 48.739167ms
2017-05-23T01:50:02Z D! Output [kafka] wrote batch of 1000 metrics in 47.372537ms
2017-05-23T01:50:02Z D! Output [kafka] wrote batch of 1000 metrics in 41.691296ms
2017-05-23T01:50:02Z D! Output [kafka] wrote batch of 1000 metrics in 116.082277ms
2017-05-23T01:50:02Z D! Output [kafka] wrote batch of 1000 metrics in 153.724614ms
2017-05-23T01:50:02Z D! Output [kafka] wrote batch of 1000 metrics in 150.281489ms
2017-05-23T01:50:02Z D! Output [kafka] wrote batch of 1000 metrics in 200.805335ms
2017-05-23T01:50:02Z D! Output [kafka] wrote batch of 1000 metrics in 188.87451ms

@danielnelson danielnelson changed the title Telegraf performance use statsd input plugin & kafka output plugin Kafka output performance is worse than expected May 23, 2017
@danielnelson danielnelson reopened this May 23, 2017
@danielnelson danielnelson added area/kafka performance problems with decreased performance or enhancements that improve performance labels Jun 8, 2017
@dennisjbr
Copy link

We are seeing the same issue with telegraf 1.3.1. Did testing with a simple PHP script and metrics are posted immediately. But with the telegraf plugin, the performance is abysmal.
Taking 6 minutes to post?? Debug output:

D! Output [kafka] wrote batch of 1000 metrics in 6m40.101144501s

@danielnelson
Copy link
Contributor

Wow, 6m is really bad. Do you think you could capture a trace?

If you run telegraf with --pprof-addr :6060 you can get a trace with curl 127.0.0.1:6060/debug/pprof/trace?seconds=60 > trace.out. Run this while it is outputting metrics and upload the trace.out file. You can also look at the trace file youself with go tool trace -http=':7070' trace trace.out

@dennisjbr
Copy link

Hey Daniel, here is the file. I did it for 5 minutes though, I hope that is OK.
trace.out.zip

Also here is some debug output during the trace.. As you can see it finally wrote to kafka at the 2 minute mark:

2017/06/15 19:18:26 I! Using config file: /etc/telegraf/telegraf.conf
2017/06/15 19:18:26 I! Starting pprof HTTP server at: http://localhost:6060/debug/pprof
2017-06-16T02:18:26Z D! Attempting connection to output: influxdb
2017-06-16T02:18:26Z D! Successfully connected to output: influxdb
2017-06-16T02:18:26Z D! Attempting connection to output: kafka
2017-06-16T02:18:26Z D! Successfully connected to output: kafka
2017-06-16T02:18:26Z I! Starting Telegraf (version 1.3.1)
2017-06-16T02:18:26Z I! Loaded outputs: influxdb kafka
2017-06-16T02:18:26Z I! Loaded inputs: inputs.kernel inputs.mem inputs.processes inputs.system inputs.influxdb inputs.net inputs.cpu inputs.diskio inputs.internal inputs.kapacitor inputs.ntpq inputs.disk inputs.swap
2017-06-16T02:18:26Z I! Tags enabled: host=data-1 hv=test-hv-value
2017-06-16T02:18:26Z I! Agent Config: Interval:10s, Quiet:false, Hostname:"data-1", Flush Interval:10s
2017-06-16T02:18:41Z D! Output [influxdb] buffer fullness: 255 / 10000 metrics.
2017-06-16T02:18:41Z D! Output [kafka] buffer fullness: 255 / 10000 metrics.
2017-06-16T02:18:41Z D! Output [influxdb] wrote batch of 255 metrics in 13.89314ms
2017-06-16T02:18:50Z W! Skipping a scheduled flush because there is already a flush ongoing.
2017-06-16T02:19:00Z W! Skipping a scheduled flush because there is already a flush ongoing.
2017-06-16T02:19:10Z W! Skipping a scheduled flush because there is already a flush ongoing.
2017-06-16T02:19:20Z W! Skipping a scheduled flush because there is already a flush ongoing.
2017-06-16T02:19:30Z W! Skipping a scheduled flush because there is already a flush ongoing.
2017-06-16T02:19:40Z W! Skipping a scheduled flush because there is already a flush ongoing.
2017-06-16T02:19:50Z W! Skipping a scheduled flush because there is already a flush ongoing.
2017-06-16T02:20:00Z D! Output [influxdb] wrote batch of 1000 metrics in 22.074081ms
2017-06-16T02:20:00Z W! Skipping a scheduled flush because there is already a flush ongoing.
2017-06-16T02:20:30Z E! Error in plugin [inputs.diskio]: took longer to collect than collection interval (10s)
2017-06-16T02:20:30Z E! Error in plugin [inputs.processes]: took longer to collect than collection interval (10s)
2017-06-16T02:20:30Z E! Error in plugin [inputs.influxdb]: took longer to collect than collection interval (10s)
2017-06-16T02:20:30Z E! Error in plugin [inputs.system]: took longer to collect than collection interval (10s)
2017-06-16T02:20:30Z E! Error in plugin [inputs.kapacitor]: took longer to collect than collection interval (10s)
2017-06-16T02:20:30Z E! Error in plugin [inputs.cpu]: took longer to collect than collection interval (10s)
2017-06-16T02:20:30Z E! Error in plugin [inputs.ntpq]: took longer to collect than collection interval (10s)
2017-06-16T02:20:30Z E! Error in plugin [inputs.disk]: took longer to collect than collection interval (10s)
2017-06-16T02:20:30Z E! Error in plugin [inputs.internal]: took longer to collect than collection interval (10s)
2017-06-16T02:20:40Z E! Error in plugin [inputs.diskio]: took longer to collect than collection interval (10s)
2017-06-16T02:20:40Z E! Error in plugin [inputs.swap]: took longer to collect than collection interval (10s)
2017-06-16T02:20:40Z E! Error in plugin [inputs.mem]: took longer to collect than collection interval (10s)
2017-06-16T02:20:40Z E! Error in plugin [inputs.processes]: took longer to collect than collection interval (10s)
2017-06-16T02:20:40Z E! Error in plugin [inputs.influxdb]: took longer to collect than collection interval (10s)
2017-06-16T02:20:40Z E! Error in plugin [inputs.kernel]: took longer to collect than collection interval (10s)
2017-06-16T02:20:40Z E! Error in plugin [inputs.net]: took longer to collect than collection interval (10s)
2017-06-16T02:20:40Z E! Error in plugin [inputs.system]: took longer to collect than collection interval (10s)
2017-06-16T02:20:40Z E! Error in plugin [inputs.kapacitor]: took longer to collect than collection interval (10s)
2017-06-16T02:20:40Z E! Error in plugin [inputs.cpu]: took longer to collect than collection interval (10s)
2017-06-16T02:20:40Z E! Error in plugin [inputs.ntpq]: took longer to collect than collection interval (10s)
2017-06-16T02:20:40Z E! Error in plugin [inputs.disk]: took longer to collect than collection interval (10s)
2017-06-16T02:20:40Z E! Error in plugin [inputs.internal]: took longer to collect than collection interval (10s)
2017-06-16T02:20:41Z D! Output [kafka] wrote batch of 255 metrics in 2m0.190538615s

@dennisjbr
Copy link

Sorry should have tagged you @danielnelson

@danielnelson
Copy link
Contributor

Looking through the trace all the time is all being spent in the sarama library. We did update this library recently though, do you think you could try with the latest nightly?

@dennisjbr
Copy link

@danielnelson sure I can test it out sometime this weekend. Let me know if I should reach out to via Email. Mention my name to Colm and he can get you the deets.

@danielnelson
Copy link
Contributor

@dennisjbr Can you add your kafka output's configuration?

@dennisjbr
Copy link

dennisjbr commented Jun 18, 2017

@danielnelson FYI I replaced IP address with y.y.y.y, replaced topic with "topic", and removed commented lines out for brevity.

[[outputs.kafka]]
brokers = ["y.y.y.y:9092"]
topic = "topic"
max_retry = 3
data_format = "influx"

@dennisjbr
Copy link

@danielnelson No workie. Still is hung up on kafka and it hangs up the whole agent and stops sending data to all outputs. That 3rd party library is looking questionable ;-) Thanks in advance.

@danielnelson
Copy link
Contributor

Performance should be much improved in 1.8 with the changes made in #4491. If possible can you help test the development builds with the nightly builds?

@hxinhan
Copy link

hxinhan commented Sep 10, 2018

@danielnelson I have the same problem running release 1.7.1. While the performance is indeed much better in 1.8 with changes in #4491! When shall we roll out 1.8 release?

@glinton
Copy link
Contributor

glinton commented Sep 10, 2018

@HansonHH 1.8 should be released by the end of the month with an RC coming before that.

@hxinhan
Copy link

hxinhan commented Sep 10, 2018

@glinton Thanks for the info! Looking forward to 1.8 with the latest and greatest!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/kafka performance problems with decreased performance or enhancements that improve performance
Projects
None yet
Development

No branches or pull requests

6 participants