The following is a guide to understand improving system reliability through logging, monitoring, and alerting, and a process of continuous improvement.
If we want to consider ourselves 'engineers' our systems need to work reliably. When they do not work, we need to know when they are failing and why they are failing.
There are several parts to this documentation.
-
Process of Making Unreliable Systems Reliable - A process to make incremental improments to existing systems
-
Slides for 'Shedding Light on Black Box Services' talk - Broad overview of why we like reliable systems and the tools we can use to get them
-
Logging Fundamentals - Cover all the basics of logging
-
Logging Architecture - Things to think about when choosing a logging framework
-
Site Reliability Engineering Book - Summary of the more valuable parts of the book
-
When to Conduct a Postmortem - A critical part of the continuous process
This documentation is primarily limited to application logging. OS, web service, and other types of logging will not be covered.
The majority of the ideas in this repo have been taken from the places where I learned them:
- Google Site Reliability Engineering book
- Prometheus documentation
- There is a great ChangeLog podcast where the creator of Prometheus discusses why the tool was built