-
Notifications
You must be signed in to change notification settings - Fork 96
Nagios Hilary errors
Simon Gaeremynck edited this page May 6, 2015
·
1 revision
Summary
There's a Nagios check that will start to complain when x amount of errors have been generated in the Hilary cluster. "Errors" are considered to be log().error
invocations.
Every time an error is logged, the logger:error.count
key of the oae-telemetry:counts:data
hash will be incremented. Nagios will check that value periodically and complain if it goes over a certain threshold.
Actions to take on warning/error
When the check goes into a warning or error state, you should check the logs on the syslog machines like so (increment the number of lines to grep for more errors):
tail-hilary -n 400 | filter-bunyan -l error
Once you've resolved the issue, you can reset the count by setting the count back to 0. There's a script on cache0
in /root/reset-error-count
that does this for you.