Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[inputs/modbus] Implement retry when slave is busy #7271

Merged
merged 8 commits into from
Apr 21, 2020

Conversation

srebhan
Copy link
Member

@srebhan srebhan commented Apr 2, 2020

Required for all PRs:

  • Signed CLA.
  • Associated README.md updated.
  • Has appropriate unit tests.

Currently, when the slave device is busy, e.g. because it already serves a request of another client, we get an error and the measurement is lost for the gathering interval. This potentially leads to many lost measurements for very busy devices.

With this PR you can specify the number of retry attempts (and a wait time) to be performed if the device is busy. This achives a higher probability to get a measurement on busy devices for each query interval.

Copy link
Contributor

@ssoroka ssoroka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good. Will need to decide what to do if the RetriesWaitTime * Retries is too long. See comments.

plugins/inputs/modbus/modbus.go Outdated Show resolved Hide resolved
plugins/inputs/modbus/modbus.go Outdated Show resolved Hide resolved
plugins/inputs/modbus/modbus.go Outdated Show resolved Hide resolved
@srebhan srebhan requested a review from reimda April 17, 2020 14:46
@danielnelson
Copy link
Contributor

We have similar logic in the SNMP plugin, and it has been somewhat problematic. The big issues we have there are:

  • Total request time can be very hard to predict, since its not clear how many requests will be made, and so it is hard to set the correct retry settings.
  • May cause intervals to be skipped which can be worse than skipping especially if polling multiple devices (less of a concern due to this plugin connecting to a single controller).
  • Can make it so that Telegraf becomes unresponsive to shutdown signals. As Telegraf will wait for the plugin to complete Gather.
  • Results often get spread out over the interval more than expected, not necessarily an issue as the timestamp is set correctly, but usually not desired and can make aggregation difficult.
  • Causes increased load on the device.

Some of these issues could be addressed by having a total Gather timeout.

We should consider if it would be enough to ignore ExceptionCodeServerDeviceBusy errors, continuing to request the remaining data instead of the current stop behavior, and leaving "retry" to be handled by Telegraf at the next interval.

@srebhan
Copy link
Member Author

srebhan commented Apr 18, 2020

@danielnelson while I do understand your concerns things here are a bit more complicated. First of all the "stop behavior" only applies for one device and one plugin-process, so influences on other devices are not to be expected. Furthermore, we use the information gathered to do some control-decisions and in this case a missing datum is one of the worst things that can happen. So while a jitter of 1 or 2 seconds is perfectly fine, having a whole missing interval is bad and we cannot reduce the interval too much as otherwise the concurring process will not get access to the device any longer.

So while I agree that the implementation is sub-optimal I think there are two possible solutions:

  1. The plugin gets to know the interval and can take care to finish before the interval passed. While this is also not optimal as you need to trust the plugin to do the right thing, you at least allow the plugin to do the right thing. ;-)
  2. The perfect solution is that you allow Gather() to return the wish to retry and handle the things in the agent. This way you can fix the shutdown as well as the missed interval problems...

Copy link
Contributor

@danielnelson danielnelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay we can proceed with this solution. It seems to me that it comes down to how frequently you expect these errors to occur. If you get busy errors every interval, you need retries in plugin like this. If it only happens once an hour it probably wouldn't be the end of the world to miss an interval.

Furthermore, we use the information gathered to do some control-decisions and in this case a missing datum is one of the worst things that can happen.

I do suggest then that you don't take control decisions if all data is not present, just to be extra safe.

Comment on lines +25 to +27
## NOTE: Please make sure that the overall retry time (#retries * wait time)
## is always smaller than the query interval as otherwise you will get
## an "did not complete within its interval" warning.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove this note, I think this isn't the right advice. It may actually be desired to configure the plugin to exceed the interval. This way you are more likely to get a response, and not get into a situation where the device never fully reply in time.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this note was requested earlier by @ssoroka:

Agh. we might not be exposing that setting to the plugin. May not be possible.. Ok maybe just a note in the settings not to pick a value that would end up retrying longer than your interval.

Could you guy please resolve this conflict and tell me if I should leave that comment in or not... ;-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can resolve this post merge.

@srebhan
Copy link
Member Author

srebhan commented Apr 21, 2020

I do suggest then that you don't take control decisions if all data is not present, just to be extra safe.

That's exactly what we do and that's why we need the retries. If another process (that might not be under our control) is reading by chance at the same time, we will never get any data without retrying.

@danielnelson danielnelson added this to the 1.15.0 milestone Apr 21, 2020
@danielnelson danielnelson added area/modbus feat Improvement on an existing feature such as adding a new setting/mode to an existing plugin labels Apr 21, 2020
@danielnelson danielnelson merged commit 1006c65 into influxdata:master Apr 21, 2020
@srebhan srebhan deleted the modbus_retry branch April 21, 2020 18:47
idohalevi pushed a commit to idohalevi/telegraf that referenced this pull request Sep 29, 2020
arstercz pushed a commit to arstercz/telegraf that referenced this pull request Mar 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/modbus feat Improvement on an existing feature such as adding a new setting/mode to an existing plugin
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants