Ridiculously simple server health alerts
This is an extremely lightweight, stupidly simple to extend, and easy to hack server health monitor. For when Zabbix, Cacti, NetData, OpenNMS, Icigna, etc. are WAY overkill. Use it to make sure your server isn't about to explode for some reason.
Should run on most Linux servers. All dependencies are probably already installed, like curl
, wget
, grep
, awk
, cut
, tr
etc.
Just run install.sh
as root. Here's a one-liner to download that file from this repository, run it, and then delete it...
sudo wget -q https://raw.githubusercontent.com/gazugafan/healthbot/master/install.sh && sudo bash install.sh; sudo rm -f install.sh
If you'd rather install manually, here's what the install script does...
- Puts
src/healthbot.sh
at/usr/bin/healthbot
with 755 permissions - Puts everything in
src/etc/healthbot
at/etc/healthbot
with 755 permissions (files here can be run as bash scripts with full root permissions, so be sure to restrict write permissions) - Creates a cronjob file at
/etc/cron.d
that runs/usr/bin/healthbot
every hour as root
Default configuration files will be created at /etc/healthbot
. Only files ending in .enabled
are used. Everything else is disabled and will always be ignored by HealthBot. This makes it easy to toggle features on/off.
Importantly, the HealthBot configuration files are just regular bash scripts. You can put any commands you'd like in them, and they'll be run like normal if HealthBot decides to load the file. In fact, each configuration file actually contains the code to perform its own task. This keeps things organized and makes it super easy to add your own health checks and alerts.
The main configuration file is /etc/healthbot/general.enabled
, but really the only thing in there is a little documentation and a single option for alert throttling. You might want to throttle alerts after you're done testing to make sure you don't end up spamming yourself whenever a health problem occurs.
In /etc/healthbot/checks
you'll find all of the health checks that HealthBot will perform every time it runs. Again, only ones ending with .enabled
will be included. Anything else is disabled. The available health checks out-of-the-box are...
- storage: check for low disk space
- inodes: check for low inode availability
- swap: check for low swap disk space
- memory: check for high system memory usage
- load: check for high average load
- processes: make sure important processes are running
- web: make sure a web request returns as expected
- ping: make sure certain servers or services are responding
- replication: check that a MySQL replicant is running well
- backup: check that a recent local backup file exists
- supervisor: check that all supervisord processes are running
- docker: check that important docker containers are running
Just change the extensions on these configuration files to .enabled
or .disabled
to turn on the ones you want. Edit each file to tweak their configuration options. If you look below the configuration options in each file, you'll see the actual code that runs to perform the check. They're all REALLY simple. Feel free to hack at them. Or better yet, copy and paste one to start creating your own!
In /etc/healthbot/alerts
you'll find all of the alerts that HealthBot will send whenever the health checks find any problems. Again, only ones ending with .enabled
are enabled. Anything else is disabled. There are 2 included out-of-the-box...
- stdout: just echos out the problems to the commandline. Mostly useful for testing, or maybe for piping to something else
- ifttt: Sends the problems to an IFTTT webhook. You can use this as an IFTTT trigger and then do whatever you want with it. Like maybe send yourself an email or give yourself a call?
The main HealthBot script is run via an hourly cronjob (you might consider increasing the frequency after everything is configured). Everytime it runs, it...
- Loads the enabled general configuration file(s) at
/etc/healthbot
(probably just that onegeneral.enabled
file). - Runs the enabled health checks in
/etc/healthbot/checks
- If any problems are found, and alert throttling isn't an issue, it runs all the enabled alerts in
/etc/healthbot/alerts
Just run healthbot
to run HealthBot once and test your configuration immediately. By default, stdout alerts are enabled, which will echo out any problems back to the command-line for you to see. If you don't see anything, it's possible that all of the default health checks are passing. By default, HealthBot checks to make sure nginx, php-fpm, and mysqld are running in the processes check, and that you're not running lower than 80% capacity on storage/swap/inode space in the storage, swap, and inode checks. Tweak some of those thresholds or add some dummy processes to see health problems in action!
It's so easy! Just copy one of the checks or alerts and use it as a base to create your own. Each one contains both it's own configuration options AND the code to make it work. They're honestly really simple. If you come up with a good one, feel free to send a PR and I'll add it for everyone else to use!
If your check finds a problem, it should add it to the health_problems
array. Your alert will probably want to reference the health_problems
array to generate some sort of message about the problems that are occurring.
Just delete /usr/bin/healthbot
, /etc/healthbot
, and /etc/cron.d/healthbot
. There might also be an empty file at /tmp/healthbot_stamp
, which you can delete.
Inspired by stephenlang/system-health-check and nozel-org/serverbot. Thanks!!