commands/monitoring.md at main · wsoula/commands · GitHub

KPI

error rate - percent of errors (in log or 4xx/5xx from cw)
availability yield - percent of well formed requests that succeeded
availability harvest - data in request / total data

Four Golden Signals

Latency
Traffic
Errors
Saturation

High Level Requirements

monitoring of all resources in path of request ("there is a problem and it doesn't look like code")

aws managed resource monitoring
host monitoring at fleet level
host monitoring at host level

alerting on errors in logs, or metrics made from logs
tracing for path of request through services in a nice visual format and correlate trace to log messages
dashboards
endpoint monitoring of starz app, but possibly also third parties
robust api for 3rd party integrations and getting data out for retention