Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mqtt_consumer stops gathering metrics #921

Closed
jvermillard opened this issue Mar 23, 2016 · 43 comments · Fixed by #941
Closed

mqtt_consumer stops gathering metrics #921

jvermillard opened this issue Mar 23, 2016 · 43 comments · Fixed by #941
Labels
bug unexpected problem or unintended behavior

Comments

@jvermillard
Copy link

Using the master (id: f543dbb )

Debug output:

> mqtt_consumer,host=qa-mosquitto,topic=$SYS/broker/bytes/received value=4485i 1458747929312148889
> mqtt_consumer,host=qa-mosquitto,topic=$SYS/broker/bytes/sent value=89017i 1458747929312192880
2016/03/23 15:45:37 Wrote 2 metrics to output cloudwatch in 18.706498ms
> mqtt_consumer,host=qa-mosquitto,topic=$SYS/broker/bytes/received value=4487i 1458747940315048731
> mqtt_consumer,host=qa-mosquitto,topic=$SYS/broker/bytes/sent value=89084i 1458747940315094247
2016/03/23 15:45:49 Wrote 2 metrics to output cloudwatch in 21.573442ms
> mqtt_consumer,host=qa-mosquitto,topic=$SYS/broker/bytes/received value=4489i 1458747951317141551
> mqtt_consumer,host=qa-mosquitto,topic=$SYS/broker/bytes/sent value=89151i 1458747951317185583
2016/03/23 15:46:01 Wrote 2 metrics to output cloudwatch in 103.83351ms
> mqtt_consumer,host=qa-mosquitto,topic=$SYS/broker/bytes/received value=4493i 1458747962299422282
> mqtt_consumer,host=qa-mosquitto,topic=$SYS/broker/bytes/sent value=89220i 1458747962299461268
2016/03/23 15:46:07 Gathered metrics, (1m0s interval), from 1 inputs in 7.260443959s
> mqtt_consumer,host=qa-mosquitto,topic=$SYS/broker/bytes/received value=4495i 1458747973301171090
> mqtt_consumer,host=qa-mosquitto,topic=$SYS/broker/bytes/sent value=89287i 1458747973301217444
2016/03/23 15:46:13 Wrote 4 metrics to output cloudwatch in 35.67736ms
> mqtt_consumer,host=qa-mosquitto,topic=$SYS/broker/bytes/received value=4497i 1458747984303795901
> mqtt_consumer,host=qa-mosquitto,topic=$SYS/broker/bytes/sent value=89354i 1458747984303867446
2016/03/23 15:46:26 Wrote 2 metrics to output cloudwatch in 29.451676ms
2016/03/23 15:47:06 Gathered metrics, (1m0s interval), from 1 inputs in 6.169308477s
2016/03/23 15:48:04 Gathered metrics, (1m0s interval), from 1 inputs in 4.868031383s
2016/03/23 15:49:08 Gathered metrics, (1m0s interval), from 1 inputs in 8.471856181s
2016/03/23 15:50:09 Gathered metrics, (1m0s interval), from 1 inputs in 9.927458651s
2016/03/23 15:51:05 Gathered metrics, (1m0s interval), from 1 inputs in 5.143343762s
2016/03/23 15:52:02 Gathered metrics, (1m0s interval), from 1 inputs in 2.918680205s
2016/03/23 15:53:07 Gathered metrics, (1m0s interval), from 1 inputs in 7.1512482s
2016/03/23 15:54:00 Gathered metrics, (1m0s interval), from 1 inputs in 863.127737ms
2016/03/23 15:55:03 Gathered metrics, (1m0s interval), from 1 inputs in 3.247517878s
2016/03/23 15:56:03 Gathered metrics, (1m0s interval), from 1 inputs in 3.212414598s
2016/03/23 15:57:08 Gathered metrics, (1m0s interval), from 1 inputs in 8.836702264s
2016/03/23 15:58:06 Gathered metrics, (1m0s interval), from 1 inputs in 6.393403739s
2016/03/23 15:59:05 Gathered metrics, (1m0s interval), from 1 inputs in 5.508096694s
2016/03/23 16:00:04 Gathered metrics, (1m0s interval), from 1 inputs in 4.12486766s
2016/03/23 16:01:00 Gathered metrics, (1m0s interval), from 1 inputs in 408.861612ms

After some times nothing is pushed to cloudwatch

@sparrc
Copy link
Contributor

sparrc commented Mar 23, 2016

From this output it appears that nothing is being gathered from your mqtt_consumer input, so there is nothing to write to cloudwatch

@sparrc sparrc closed this as completed Mar 23, 2016
@sparrc
Copy link
Contributor

sparrc commented Mar 23, 2016

you can re-open as a different issue if the problem is with the mqtt_consumer not gathering metrics? Do you expect it to be gathering metrics?

@sparrc
Copy link
Contributor

sparrc commented Mar 23, 2016

if you think it's an issue with the mqtt_consumer, can you provide the commit of the paho mqtt client that you're using? If it doesn't match the commit in Godeps I would suggest rebuilding and using gdm.

@jvermillard
Copy link
Author

the commit is matching
commit 617c801af238c3af2d9e72c5d4a0f02edad03ce5
Author: Al S-M [email protected]
Date: Mon Feb 8 11:21:59 2016 +0000

@sparrc sparrc changed the title After some times the cloudwatch output plugin stop mqtt_consumer stops gathering metrics Mar 23, 2016
@sparrc
Copy link
Contributor

sparrc commented Mar 23, 2016

okay, can you provide more details? How long does it take for this to happen? do you have any relevant mosquitto logs? the mqtt_consumer section of the config file would help as well.

@sparrc sparrc reopened this Mar 23, 2016
@jvermillard
Copy link
Author

let me see if I can do a tcpdump now I know it's a mqtt problem.
also I'm building with go1.5 linux/amd64

@jvermillard
Copy link
Author

if I stop restart the telgraf process it's working again

@jvermillard
Copy link
Author

tshark show me no publish coming from mosquitto, so maybe the problem is mosquitto or the auto reconnect feature of the paho mqtt client which doesn't resubscribe when reconnecting on non persitent sessions.
I'll try to use persistent sessions and try again.

sparrc added a commit that referenced this issue Mar 23, 2016
This has not been consistently reliable. This is likely a bug with the
paho MQTT client library. It was supposed to be fixed, but will need to
be revisited in the future.

fixes #921
@sparrc sparrc added the bug unexpected problem or unintended behavior label Mar 23, 2016
@jvermillard
Copy link
Author

So with persistent session, no problem during the last 15 hours.

@sparrc
Copy link
Contributor

sparrc commented Mar 24, 2016

@jvermillard could you provide your OS and config file? I'd like to try to reproduce so I can file a bug on the paho mqtt client project

@sparrc
Copy link
Contributor

sparrc commented Mar 24, 2016

I'd also like to reproduce so I can test if this library does better: https://github.com/yosssi/gmq

it certainly has better go style and unit test coverage

@jvermillard
Copy link
Author

I'm using 1.4.8-0mosquitto1 from debian jessie in a docker container (docker 1.8.3) running on CoreOS 835.13.0

@jvermillard
Copy link
Author

Also I use the default mosquitto config

@sparrc
Copy link
Contributor

sparrc commented Mar 29, 2016

could you provide the telegraf config? I haven't been able to reproduce so far

@jvermillard
Copy link
Author

# Configuration for telegraf agent,
[agent]
  ## Default data collection interval for all inputs
  interval = "60s"
  ## Rounds collection interval to 'interval',
  ## ie, if interval=\"10s\" then always collect on :00, :10, :20, etc.
  round_interval = true

  ## Telegraf will cache metric_buffer_limit metrics for each output, and will
  ## flush this buffer on a successful write.
  metric_buffer_limit = 1000
  ## Flush the buffer whenever full, regardless of flush_interval.
  flush_buffer_when_full = true

  ## Collection jitter is used to jitter the collection by a random amount.
  ## Each plugin will sleep for a random time within jitter before collecting.
  ## This can be used to avoid many plugins querying things like sysfs at the
  ## same time, which can have a measurable effect on the system.
  collection_jitter = "10s"

  ## Default flushing interval for all outputs. You shouldn't set this below
  ## interval. Maximum flush_interval will be flush_interval + flush_jitter
  flush_interval = "10s"
  ## Jitter the flush interval by a random amount. This is primarily to avoid
  ## large write spikes for users running a large number of telegraf instances.
  ## ie, a jitter of 5s and interval 10s means flushes will happen every 10-15s
  flush_jitter = "10s"

  ## Run telegraf in debug mode
  debug = false
  ## Run telegraf in quiet mode
  quiet = true
  ## Override default hostname, if empty use os.Hostname()
  hostname = \"",{"Ref": "EnvName" },"-mosquitto",

# Read metrics from MQTT topic(s)
[[inputs.mqtt_consumer]]
  servers = ["mosquitto:1883"]
  ## MQTT QoS, must be 0, 1, or 2
  qos = 0

  persistent_session = true
  client_id = "telegraf"

  ## Topics to subscribe to
  topics = [
    "$SYS/broker/clients/total",
    "$SYS/broker/messages/stored",
    "$SYS/broker/subscriptions/count",
    "$SYS/broker/bytes/sent",
    "$SYS/broker/bytes/received",
    "$SYS/broker/load/sockets/1min",
    "$SYS/broker/load/bytes/received/1min",
    "$SYS/broker/load/bytes/sent/1min",
    "$SYS/broker/load/connections/1min",
    "$SYS/broker/messages/inflight",
    "$SYS/broker/heap/current",
    "$SYS/broker/load/publish/received/1min",
  ]
  data_format = \"value\"
  data_type = \"float\"
,
# Configuration for AWS CloudWatch output.
[[outputs.cloudwatch]]
  ## Amazon REGION\n",
  region = '",{ "Ref" : "AWS::Region" },"

  ## Namespace for the CloudWatch MetricDatums
  namespace = mosquitto

@sparrc
Copy link
Contributor

sparrc commented Mar 30, 2016

but you don't see the problem with persistent_session = true, correct? Or now you are?

@jvermillard
Copy link
Author

oups sorry... yes with persistent_session = true no problems with persistent_session = false I get some disconnection from time to time (from 20 minutes to 1hour and a half)

@jvermillard
Copy link
Author

so if you want to test the bug use this config without persistent_session

@sparrc
Copy link
Contributor

sparrc commented Mar 30, 2016

Oh, got it, I was able to reproduce within 3 gathers this time. The reason is that your collection interval is larger than the default MQTT KeepAlive timer (60s).

But this is not supposed to be a problem at all, the client library doesn't appear to be sending the proper keepalive heartbeat messages and is then disconnecting. I'm not sure the best solution, we could just set the keepalive to a large number that users would likely never hit (2 hours?).

@sparrc
Copy link
Contributor

sparrc commented Mar 30, 2016

I've also just realized that they've moved the mqtt library off gerrit onto https://github.com/eclipse/paho.mqtt.golang, and there are some new changes and fixes there. I'll test with that updated library and see if it fixes the problem.

@sparrc
Copy link
Contributor

sparrc commented Mar 30, 2016

@jvermillard in my testing I was getting the disconnect errors within 2-10 minutes using master. After updating to the updated github paho client I wasn't able to reproduce at all, and had one instance running for more than 6 hours without disconnecting. I've pushed this change into master (see #941). Please feel free to re-open if you still see the issue.

@chaton78
Copy link
Contributor

chaton78 commented Apr 2, 2016

I pulled the repo today and I got the issue again... paho was pulled from Restoring /root/work/src/github.com/eclipse/paho.mqtt.golang to 4ab3e867810d1ec5f35157c59e965054dbf43a0d

@sparrc
Copy link
Contributor

sparrc commented Apr 2, 2016

@chaton78 can you provide any relevant mqtt or telegraf logs? and your telegraf config?

@chaton78
Copy link
Contributor

chaton78 commented Apr 7, 2016

Sorry for the delay.. I've redone my test with 0.12-1 still have the same issue. Please note that I have a bridged mosquitto broker connecting to the same rabbitmq server. No errors at the same time.

This is what I have in journalctl (Centos 7)

Apr 07 08:13:10 dorval telegraf[14623]: 2016/04/07 08:13:10 Wrote 23 metrics to output influxdb in 139.81154ms
Apr 07 08:13:20 dorval telegraf[14623]: 2016/04/07 08:13:20 Gathered metrics, (10s interval), from 1 inputs in 27µs
Apr 07 08:13:30 dorval telegraf[14623]: 2016/04/07 08:13:30 Gathered metrics, (10s interval), from 1 inputs in 25.375µs

Last write was at 8:13:10... nothing after...

In rabbitmq, at the same time and after:

=INFO REPORT==== 7-Apr-2016::08:13:06 ===
MQTT detected network error for "162.xxx.yyy.33:41414 -> 162.xxx.yyy.33:8883": peer closed TCP connection

=INFO REPORT==== 7-Apr-2016::08:13:06 ===
accepting MQTT connection <0.23177.5> (162.xxx.yyy.33:37658 -> 162.xxx.yyy.33:8883)

=INFO REPORT==== 7-Apr-2016::08:13:15 ===
MQTT detected network error for "162.xxx.yyy.33:37658 -> 162.xxx.yyy.33:8883": peer closed TCP connection

=INFO REPORT==== 7-Apr-2016::08:13:16 ===
accepting MQTT connection <0.23228.5> (162.xxx.yyy.33:37679 -> 162.xxx.yyy.33:8883)

=INFO REPORT==== 7-Apr-2016::13:25:46 ===
MQTT detected network error for "162.xxx.yyy.33:37679 -> 162.xxx.yyy.33:8883": peer closed TCP connection

=INFO REPORT==== 7-Apr-2016::13:25:50 ===
accepting MQTT connection <0.14489.7> (162.xxx.yyy.33:41572 -> 162.xxx.yyy.33:8883)

@chaton78
Copy link
Contributor

chaton78 commented Apr 7, 2016

My config for telegraf:

username = "hahaha"
password = "hihihi"
qos = 1

persistent_session = false
client_id = "telegraf-dorval"

servers = ["dorval.xxx.yyy:8883"]
topics = ["/feeds/sensors"]
ssl_ca = "/etc/keys/ca.crt"
ssl_cert = "/etc/keys/server.crt"
ssl_key = "/etc/keys/server.key"

@sparrc
Copy link
Contributor

sparrc commented Apr 7, 2016

@chaton78 can you provide more details? What is your telegraf agent config? How long does it take for this problem to occur?

@sparrc sparrc reopened this Apr 7, 2016
@vlasov01
Copy link

vlasov01 commented Apr 7, 2016

@sparrc I think I have the same issue. I need to restart Telegraf on Fedora22/ARM every couple days. I suspect it is related to network connection drops, when it results in MQTT client connection drop to the MQTT broker as well, and then Telegraf can't recover it. As far as I understand MQTT specs, it is a client responsibility to recover connection to the broker.

@chaton78
Copy link
Contributor

chaton78 commented Apr 7, 2016

I can take few hours to a full day....
Agent config

[agent]
flush_interval = "10s"
flush_jitter = "5s"
interval = "10s"
round_interval = true
debug=false

Running on Linux 3.10.0-327.13.1.el7.x86_64
RabbitMQ 3.5.6 on Erlang 18.3

I am using the public ip of my interface. Let me switch to the loopback. It is my ideal config but we will see....

@sparrc
Copy link
Contributor

sparrc commented Apr 7, 2016

thanks all for the details, https://github.com/eclipse/paho.mqtt.golang needs to be patched. As far as I can tell, the client reconnect isn't working properly. I'll need to figure out a way to reproduce more reliably and then can push up a fix for that.

@sparrc
Copy link
Contributor

sparrc commented Apr 7, 2016

It actually looks like this might be fixed as of a few days ago: eclipse/paho.mqtt.golang#32,

I'll update the telegraf Godeps file to grab this fix. If anyone is able to build from source and test that would be much appreciated :)

@sparrc sparrc closed this as completed in 1bf904f Apr 7, 2016
@sparrc
Copy link
Contributor

sparrc commented Apr 7, 2016

didn't mean to close this yet

@sparrc sparrc reopened this Apr 7, 2016
@chaton78
Copy link
Contributor

chaton78 commented Apr 8, 2016

I built it this morning. It is running.

@chaton78
Copy link
Contributor

So far, so good...

@chaton78
Copy link
Contributor

It happened again.. but from what I can see in rabbitmq log, client reconnected just after. Is it possible that the re-connection works but the subscription is lost?

@chaton78
Copy link
Contributor

I did a fix last night. Testing it.... Any idea why make test_short doesn't work on go 1.6... Something about calling internal packages. I'm not very familiar with go.

@sparrc
Copy link
Contributor

sparrc commented Apr 12, 2016

@chaton78 do you have your telegraf directory in github.com/influxdb/telegraf instead of github.com/influxdata/telegraf?

@chaton78
Copy link
Contributor

@sparrc I am running from my fork github/chaton78/telegraf... I have also influxdata/telegraf

@sparrc
Copy link
Contributor

sparrc commented Apr 12, 2016

building in your fork directly won't quite work correctly on Go 1.6. This is because the import paths still point to influxdata/telegraf, which your chaton78/telegraf package sees as an external package that is not allowed to have internal imports.

the best way to solve this is to rm -rf chaton78/telegraf and build from influxdata/telegraf.

You can then go into the influxdata/telegraf directory and change your git origin remote (or setup a different remote, such as fork) to chaton78/telegraf. That way you can push to your fork but you will retain the correct import paths.

@chaton78
Copy link
Contributor

Understood. Good idea.. Thank you!

@sparrc sparrc closed this as completed in 93f57ed Apr 18, 2016
@Dees7
Copy link

Dees7 commented Aug 15, 2018

Hello.
I have the same trouble. Sometimes telegraf skip and then stops gathering and pushing mqtt messages.
11111
I use Telegraf unknown (git: master 3268937) with config:

[global_tags]
[agent]
  interval = "60s"
  round_interval = true
  metric_batch_size = 500
  metric_buffer_limit = 1000
  collection_jitter = "0s"
  flush_interval = "60s"
  flush_jitter = "0s"
  precision = ""
  debug = true
  quiet = false
  hostname = ""
  omit_hostname = false
[[outputs.influxdb]]
  urls = ["http://127.0.0.1:8086"]
  database = "arduino"
  skip_database_creation = false
  retention_policy = ""
  write_consistency = "any"
  timeout = "5s"
  username = "telegraf"
  password = "pass"
[[inputs.cpu]]
  percpu = true
  totalcpu = true
  collect_cpu_time = false
  report_active = false
[[inputs.disk]]
  mount_points = ["/","/var/wal"]
[[inputs.diskio]]
[[inputs.kernel]]
[[inputs.mem]]
[[inputs.processes]]
[[inputs.swap]]
[[inputs.system]]
[[inputs.net]]
  interfaces = ["wlan0","eth0"]
[[inputs.procstat]]
  user = "pi|telegraf|grafana|influxdb|mosquitto"
[[inputs.mqtt_consumer]]
  servers = ["tcp://localhost:1883"]
  qos = 0
  connection_timeout = "120s"
  topics = ["tele/#"]
  persistent_session = true
  client_id = "telegraf"
  username = "user"
  password = "pass"
  name_override = "mqttc"
  data_format = "json"

Sonoff with Tasmota firmware push mqtt metrics every 60sec and I can view them with

mosquitto_sub -u 'user' -P 'pass'  -v -t '#'
tele/sonoff/STATE {"Time":"2018-08-15T20:35:29","Uptime":"0T00:11:14","Vcc":2.821,"POWER1":"0","POWER2":"1","Wifi":{"AP":1,"SSId":"195","RSSI":52,"APMac":"54:04:A6:CA:AF:10"}}
tele/sonoff/SENSOR {"Time":"2018-08-15T20:35:29","DS18B20":{"Temperature":41.6},"BMP280":{"Temperature":25.7,"Pressure":982.1},"ENERGY":{"Total":57.965,"ActivePower":54,"ApparentPower":60,"ReactivePower":-27,"Frequency":50.05,"Factor":0.88,"Voltage":234,"Current":0.240,"Phase angle":-28.32,"Import Active Power":58,"Export Active Power":0,"Import Reactive Power":0,"Export Reactive Power":51,"Total Reactive Power":51},"TempUnit":"C"}
tele/sonoff/STATE {"Time":"2018-08-15T20:36:35","Uptime":"0T00:12:20","Vcc":2.820,"POWER1":"0","POWER2":"1","Wifi":{"AP":1,"SSId":"195","RSSI":48,"APMac":"54:04:A6:CA:AF:10"}}
tele/sonoff/SENSOR {"Time":"2018-08-15T20:36:35","DS18B20":{"Temperature":41.2},"BMP280":{"Temperature":25.8,"Pressure":982.1},"ENERGY":{"Total":57.966,"ActivePower":54,"ApparentPower":59,"ReactivePower":-24,"Frequency":50.02,"Factor":0.90,"Voltage":234,"Current":0.248,"Phase angle":-25.85,"Import Active Power":58,"Export Active Power":0,"Import Reactive Power":0,"Export Reactive Power":51,"Total Reactive Power":51},"TempUnit":"C"}

After telegraf stops to push mqtt metrics it continues normal push local system metrics.

In this situation if I try to stop telegraf it take long time (2-3min)

Aug 15 20:27:28 piduino systemd[1]: telegraf.service: State 'stop-sigterm' timed out. Killing.
Aug 15 20:27:28 piduino systemd[1]: telegraf.service: Killing process 2711 (telegraf) with signal SIGKILL.
Aug 15 20:27:28 piduino systemd[1]: telegraf.service: Main process exited, code=killed, status=9/KILL
Aug 15 20:27:28 piduino systemd[1]: Stopped The plugin-driven server agent for reporting metrics into InfluxDB.
Aug 15 20:27:28 piduino systemd[1]: telegraf.service: Unit entered failed state.
Aug 15 20:27:28 piduino systemd[1]: telegraf.service: Failed with result 'timeout'.

@glinton
Copy link
Contributor

glinton commented Aug 15, 2018

@Dees7 I wonder if your issue is because your interval and flush_interval are the same. Maybe try setting your flush_interval to 80s

@Dees7
Copy link

Dees7 commented Aug 15, 2018

I tied (client publish 30, interval 30, flush 60).

How can I view debug as at the first message?

mqtt_consumer,host=qa-mosquitto,topic=$SYS/broker/bytes/received value=4485i 1458747929312148889

@glinton
Copy link
Contributor

glinton commented Aug 15, 2018

Do you mind opening a new issue for this, thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug unexpected problem or unintended behavior
Projects
None yet
6 participants