-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add system health dashboard #688
Conversation
Add a CloudWatch dashboard that displays the system health metrics, logs and alarms. The goal is to make it easier for on-callers to identify where issues are occuring if they are paged.
f5c34e6
to
018b498
Compare
⚠ Terrform update availableTerraform: 1.8.5 (using 1.6.6)
Terragrunt: 0.59.3 (using 0.54.8) |
Staging: rds✅ Terraform Init: Plan: 0 to add, 0 to change, 0 to destroy Show summary
Show planChanges to Outputs:
+ rds_cluster_identifier = "forms-staging-db-cluster"
You can apply this plan to save these new output values to the Terraform
state, without changing any real infrastructure.
─────────────────────────────────────────────────────────────────────────────
Saved the plan to: plan.tfplan
To perform exactly these actions, run the following command to apply:
terraform apply "plan.tfplan"
Show Conftest resultsWARN - plan.json - main - Missing Common Tags: ["aws_rds_cluster.forms"]
WARN - plan.json - main - Missing Common Tags: ["aws_secretsmanager_secret.database_secret"]
WARN - plan.json - main - Missing Common Tags: ["aws_secretsmanager_secret.database_url"]
22 tests, 19 passed, 3 warnings, 0 failures, 0 exceptions
|
Staging: alarms✅ Terraform Init: Plan: 1 to add, 0 to change, 0 to destroy Show summary
✂ Warning: plan has been truncated! See the full plan in the logs. Show planResource actions are indicated with the following symbols:
+ create
Terraform will perform the following actions:
# aws_cloudwatch_dashboard.forms_service_health will be created
+ resource "aws_cloudwatch_dashboard" "forms_service_health" {
+ dashboard_arn = (known after apply)
+ dashboard_body = jsonencode(
{
+ widgets = [
+ {
+ height = 12
+ properties = {
+ metrics = [
+ [
+ "forms",
+ "ClientSubmitSuccess",
+ {
+ color = "#2ca02c"
},
],
+ [
+ ".",
+ "ClientSubmitFailed",
+ {
+ color = "#d62728"
},
],
]
+ period = 300
+ region = "ca-central-1"
+ stacked = false
+ stat = "Sum"
+ title = "App: client submissions"
+ view = "timeSeries"
}
+ type = "metric"
+ width = 12
+ x = 0
+ y = 2
},
+ {
+ height = 6
+ properties = {
+ metrics = [
+ [
+ "forms",
+ "SubmissionSuccess",
+ {
+ color = "#2ca02c"
},
],
+ [
+ ".",
+ "SubmissionWarn",
+ {
+ color = "#ffbb78"
},
],
+ [
+ ".",
+ "SubmissionFailed",
+ {
+ color = "#d62728"
},
],
]
+ period = 300
+ region = "ca-central-1"
+ stacked = false
+ stat = "Sum"
+ title = "Lambda: submission"
+ view = "timeSeries"
}
+ type = "metric"
+ width = 12
+ x = 12
+ y = 2
},
+ {
+ height = 8
+ properties = {
+ metrics = [
+ [
+ "AWS/RDS",
+ "CPUUtilization",
+ "DBClusterIdentifier",
+ "forms-mock-db-cluster",
+ {
+ color = "#17becf"
+ region = "ca-central-1"
},
],
]
+ period = 60
+ region = "ca-central-1"
+ sparkline = true
+ stacked = false
+ stat = "Average"
+ title = "DB: CPU use"
+ view = "timeSeries"
}
+ type = "metric"
+ width = 6
+ x = 0
+ y = 74
},
+ {
+ height = 6
+ properties = {
+ metrics = [
+ [
+ "forms",
+ "ReliabilitySuccess",
+ {
+ color = "#2ca02c"
},
],
+ [
+ ".",
+ "ReliabilityWarn",
+ {
+ color = "#ffbb78"
},
],
+ [
+ ".",
+ "ReliabilityFailed",
+ {
+ color = "#d62728"
},
],
]
+ period = 300
+ region = "ca-central-1"
+ stacked = false
+ stat = "Sum"
+ title = "Lambda: reliability"
+ view = "timeSeries"
}
+ type = "metric"
+ width = 12
+ x = 12
+ y = 8
},
+ {
+ height = 2
+ properties = {
+ background = "transparent"
+ markdown = ""
}
+ type = "text"
+ width = 24
+ x = 0
+ y = 20
},
+ {
+ height = 8
+ properties = {
+ metrics = [
+ [
+ "AWS/RDS",
+ "FreeableMemory",
+ "DBClusterIdentifier",
+ "forms-mock-db-cluster",
+ {
+ color = "#9467bd"
},
],
]
+ period = 60
+ region = "ca-central-1"
+ sparkline = true
+ stacked = false
+ stat = "Average"
+ title = "DB: freeable memory"
+ view = "timeSeries"
}
+ type = "metric"
+ width = 6
+ x = 6
+ y = 74
},
+ {
+ height = 8
+ properties = {
+ metrics = [
+ [
+ "AWS/RDS",
+ "ReadLatency",
+ "DBClusterIdentifier",
+ "forms-mock-db-cluster",
+ {
+ color = "#c5b0d5"
},
],
]
+ period = 60
+ region = "ca-central-1"
+ sparkline = true
+ stacked = false
+ stat = "Average"
+ title = "DB: read latency"
+ view = "timeSeries"
}
+ type = "metric"
+ width = 6
+ x = 12
+ y = 74
},
+ {
+ height = 8
+ properties = {
+ metrics = [
+ [
+ "AWS/RDS",
+ "WriteLatency",
+ "DBClusterIdentifier",
+ "forms-mock-db-cluster",
+ {
+ color = "#7f7f7f"
},
],
]
+ period = 60
+ region = "ca-central-1"
+ sparkline = true
+ stacked = false
+ stat = "Average"
+ title = "DB: write latency"
+ view = "timeSeries"
}
+ type = "metric"
+ width = 6
+ x = 18
+ y = 74
},
+ {
+ height = 2
+ properties = {
+ background = "transparent"
+ markdown = <<-EOT
# Form submissions
Tracking form submissions flow through the system.
EOT
}
+ type = "text"
+ width = 24
+ x = 0
+ y = 0
},
+ {
+ height = 6
+ properties = {
+ metrics = [
+ [
+ "AWS/SQS",
+ "NumberOfMessagesReceived",
+ "QueueName",
+ "submission_processing.fifo",
+ {
+ color = "#8c564b"
},
],
]
+ period = 300
+ region = "ca-central-1"
+ stacked = false
+ stat = "Sum"
+ title = "Queue: submission messages"
+ view = "timeSeries"
}
+ type = "metric"
+ width = 8
+ x = 0
+ y = 14
},
+ {
+ height = 6
+ properties = {
+ metrics = [
+ [
+ "AWS/SQS",
+ "ApproximateAgeOfOldestMessage",
+ "QueueName",
+ "submission_processing.fifo",
+ {
+ color = "#7f7f7f"
+ label = "Oldest message age"
+ region = "ca-central-1"
},
],
]
+ period = 300
+ region = "ca-central-1"
+ sparkline = true
+ stat = "Average"
+ title = "Queue: submission message age"
+ view = "singleValue"
}
+ type = "metric"
+ width = 4
+ x = 8
+ y = 14
},
+ {
+ height = 6
+ properties = {
+ metrics = [
+ [
+ "forms",
+ "ReliabilityNotifySendSuccess",
+ {
+ color = "#2ca02c"
+ region = "ca-central-1"
},
],
+ [
+ ".",
+ "ReliabilityNotifySendFailed",
+ {
+ color = "#d62728"
+ region = "ca-central-1"
},
],
+ [
+ ".",
+ "ReliabilityVaultSaveSuccess",
+ {
+ color = "#1f77b4"
+ region = "ca-central-1"
},
],
+ [
+ ".",
+ "ReliabilityVaultSaveFailed",
+ {
+ color = "#ff7f0e"
+ region = "ca-central-1"
},
],
]
+ period = 300
+ region = "ca-central-1"
+ stacked = false
+ stat = "Sum"
+ title = "Lambda: reliability send/save"
+ view = "timeSeries"
}
+ type = "metric"
+ width = 12
+ x = 12
+ y = 14
},
+ {
+ height = 2
+ properties = {
+ background = "transparent"
+ markdown = ""
}
+ type = "text"
+ width = 24
+ x = 0
+ y = 43
},
+ {
+ height = 2
+ properties = {
+ background = "transparent"
+ markdown = <<-EOT
# Errors
Error logs and alarms from the app and lambdas.
EOT
}
+ type = "text"
+ width = 24
+ x = 0
+ y = 22
},
+ {
+ height = 2
+ properties = {
+ background = "transparent"
+ markdown = <<-EOT
# Lambdas
Performance metrics for the Lambda functions.
EOT
}
+ type = "text"
+ width = 24
+ x = 0
+ y = 45
},
+ {
+ height = 7
+ properties = {
+ metrics = [
+ [
+ "AWS/ECS",
+ "CPUUtilization",
+ "ServiceName",
+ "form-viewer",
+ "ClusterName",
+ "Forms",
+ {
+ region = "ca-central-1"
+ stat = "Minimum"
},
],
+ [
+ "...",
+ {
+ region = "ca-central-1"
+ stat = "Maximum"
},
],
+ [
+ "...",
+ {
+ region = "ca-central-1"
+ stat = "Average"
},
],
]
+ period = 300
+ region = "ca-central-1"
+ stacked = false
+ title = "App: CPU use"
+ view = "timeSeries"
}
+ type = "metric"
+ width = 8
+ x = 0
+ y = 36
},
+ {
+ height = 7
+ properties = {
+ metrics = [
+ [
+ "AWS/ECS",
+ "MemoryUtilization",
+ "ServiceName",
+ "form-viewer",
+ "ClusterName",
+ "Forms",
+ {
+ stat = "Minimum"
},
],
+ [
+ "...",
+ {
+ stat = "Maximum"
},
],
+ [
+ "...",
+ {
+ stat = "Average"
},
],
]
+ period = 300
+ region = "ca-central-1"
+ stacked = false
+ title = "App: memory use"
+ view = "timeSeries"
}
+ type = "metric"
+ width = 8
+ x = 8
+ y = 36
},
+ {
+ height = 2
+ properties = {
+ background = "transparent"
+ markdown = ""
}
+ type = "text"
+ width = 24
+ x = 0
+ y = 59
},
+ {
+ height = 2
+ properties = {
+ background = "transparent"
+ markdown = <<-EOT
# Load balancer
Requests, errors and response time for the app's load balancer.
EOT
}
+ type = "text"
+ width = 24
+ x = 0
+ y = 61
},
+ {
+ height = 6
+ properties = {
+ metrics = [
+ [
+ "AWS/Lambda",
+ "Invocations",
+ "FunctionName",
+ "Submission",
+ {
+ region = "ca-central-1"
},
],
+ [
+ ".",
+ "Throttles",
+ ".",
+ ".",
+ {
+ color = "#ffbb78"
+ region = "ca-central-1"
},
],
+ [
+ ".",
+ "Errors",
+ ".",
+ ".",
+ {
+ region = "ca-central-1"
},
],
]
+ period = 300
+ region = "ca-central-1"
+ stacked = false
+ stat = "Sum"
+ title = "Lambda: submission"
+ view = "timeSeries"
}
+ type = "metric"
+ width = 18
+ x = 0
+ y = 47
},
+ {
+ height = 6
+ properties = {
+ metrics = [
+ [
+ "AWS/Lambda",
+ "Duration",
+ "FunctionName",
+ "Submission",
+ "Resource",
+ "Submission",
+ {
+ color = "#555555"
+ region = "ca-central-1"
},
],
]
+ period = 300
+ region = "ca-central-1"
+ sparkline = true
+ stacked = false
+ stat = "Average"
+ title = "Lambda: submission duration"
+ view = "singleValue"
}
+ type = "metric"
+ width = 6
+ x = 18
+ y = 47
},
+ {
+ height = 6
+ properties = {
+ metrics = [
+ [
+ "AWS/Lambda",
+ "Invocations",
+ "FunctionName",
+ "reliability",
+ "Resource",
+ "reliability",
+ {
+ region = "ca-central-1"
},
],
+ [
+ ".",
+ "Throttles",
+ ".",
+ ".",
+ ".",
+ ".",
+ {
+ color = "#ffbb78"
+ region = "ca-central-1"
},
],
+ [
+ ".",
+ "Errors",
+ ".",
+ ".",
+ ".",
+ ".",
+ {
+ region = "ca-central-1"
},
],
]
+ period = 300
+ region = "ca-central-1"
+ stacked = false
+ stat = "Sum"
+ title = "Lambda: reliability"
+ view = "timeSeries"
}
+ type = "metric"
+ width = 18
+ x = 0
+ y = 53
},
+ {
+ height = 6
+ properties = {
+ metrics = [
+ [
+ "AWS/Lambda",
+ "Duration",
+ "FunctionName",
+ "reliability",
+ "Resource",
+ "reliability",
+ {
+ color = "#555"
},
],
]
+ period = 300
+ region = "ca-central-1"
+ sparkline = true
+ stacked = false
+ stat = "Average"
+ title = "Lambda: reliabiity duration"
+ view = "singleValue"
}
+ type = "metric"
+ width = 6
+ x = 18
+ y = 53
},
+ {
+ height = 7
+ properties = {
+ metrics = [
+ [
+ "ECS/ContainerInsights",
+ "NetworkRxBytes",
+ "ClusterName",
+ "Forms",
+ {
+ color = "#1f77b4"
+ region = "ca-central-1"
},
],
]
+ period = 300
+ region = "ca-central-1"
+ stacked = false
+ stat = "Sum"
+ title = "App: network bytes"
+ view = "timeSeries"
}
+ type = "metric"
+ width = 8
+ x = 16
+ y = 36
},
+ {
+ height = 7
+ properties = {
+ metrics = [
+ [
+ "AWS/ApplicationELB",
+ "RequestCount",
+ "LoadBalancer",
+ "app/form-viewer/5e6bc2d9ab810b68",
+ {
+ color = "#2ca02c"
+ label = "Request count"
+ region = "ca-central-1"
},
],
+ [
+ ".",
+ "HTTPCode_ELB_4XX_Count",
+ ".",
+ ".",
+ {
+ color = "#ffbb78"
+ label =... Show Conftest resultsWARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_event_rule.codedeploy_sns"]
WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_log_group.notify_slack"]
WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_metric_alarm.ELB_5xx_error_warn"]
WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_metric_alarm.UnHealthyHostCount-TargetGroup1"]
WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_metric_alarm.UnHealthyHostCount-TargetGroup2"]
WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_metric_alarm.alb_ddos"]
WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_metric_alarm.audit_log_dead_letter_queue_warn"]
WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_metric_alarm.cognito_login_outside_canada_warn"]
WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_metric_alarm.cognito_signin_exceeded"]
WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_metric_alarm.ddos_detected_forms_warn"]
WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_metric_alarm.ddos_detected_route53_warn[0]"]
WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_metric_alarm.forms_cpu_utilization_high_warn"]
WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_metric_alarm.forms_memory_utilization_high_warn"]
WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_metric_alarm.reliability_dead_letter_queue_warn"]
WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_metric_alarm.response_time_warn"]
WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_metric_alarm.route53_ddos[0]"]
WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_metric_alarm.twoFa_verification_exceeded"]
WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_metric_alarm.vault_data_integrity_check_lambda_iterator_age"]
WARN - plan.json - main - Missing Common Tags: ["aws_iam_role.notify_slack_lambda"]
WARN - plan.json -... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Add a CloudWatch dashboard that displays the system health metrics, logs and alarms. The goal is to make it easier for on-callers to identify where issues are occurring if they are paged.
Related