-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: add HealthyHostCount alarms to App, IdP, API #818
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Update the App, IdP and API unhealthy host alarms to only trigger warnings that post to Slack. Add healthy host alarms that trigger SEV1 OpsGenie responses when a service has no healthy hosts. This will currently only trigger an OpsGenie page for the App load balancer target groups.
⚠ Terrform update availableTerraform: 1.9.5 (using 1.9.2)
Terragrunt: 0.67.4 (using 0.63.2) |
Staging: alarms✅ Terraform Init: Plan: 6 to add, 3 to change, 2 to destroy Show summary
✂ Warning: plan has been truncated! See the full plan in the logs. Show planResource actions are indicated with the following symbols:
+ create
~ update in-place
-/+ destroy and then create replacement
Terraform will perform the following actions:
# aws_cloudwatch_dashboard.forms_service_health will be updated in-place
~ resource "aws_cloudwatch_dashboard" "forms_service_health" {
~ dashboard_body = jsonencode(
{
- widgets = [
- {
- height = 8
- properties = {
- metrics = [
- [
- "AWS/RDS",
- "CPUUtilization",
- "DBClusterIdentifier",
- "forms-staging-db-cluster",
- {
- color = "#17becf"
- region = "ca-central-1"
},
],
]
- period = 60
- region = "ca-central-1"
- sparkline = true
- stacked = false
- stat = "Average"
- title = "DB: CPU use"
- view = "timeSeries"
}
- type = "metric"
- width = 6
- x = 0
- y = 111
},
- {
- height = 8
- properties = {
- metrics = [
- [
- "AWS/RDS",
- "FreeableMemory",
- "DBClusterIdentifier",
- "forms-staging-db-cluster",
- {
- color = "#9467bd"
},
],
]
- period = 60
- region = "ca-central-1"
- sparkline = true
- stacked = false
- stat = "Average"
- title = "DB: freeable memory"
- view = "timeSeries"
}
- type = "metric"
- width = 6
- x = 6
- y = 111
},
- {
- height = 8
- properties = {
- metrics = [
- [
- "AWS/RDS",
- "ReadLatency",
- "DBClusterIdentifier",
- "forms-staging-db-cluster",
- {
- color = "#c5b0d5"
},
],
]
- period = 60
- region = "ca-central-1"
- sparkline = true
- stacked = false
- stat = "Average"
- title = "DB: read latency"
- view = "timeSeries"
}
- type = "metric"
- width = 6
- x = 12
- y = 111
},
- {
- height = 8
- properties = {
- metrics = [
- [
- "AWS/RDS",
- "WriteLatency",
- "DBClusterIdentifier",
- "forms-staging-db-cluster",
- {
- color = "#7f7f7f"
},
],
]
- period = 60
- region = "ca-central-1"
- sparkline = true
- stacked = false
- stat = "Average"
- title = "DB: write latency"
- view = "timeSeries"
}
- type = "metric"
- width = 6
- x = 18
- y = 111
},
- {
- height = 2
- properties = {
- background = "transparent"
- markdown = <<-EOT
# Form submissions
Tracking form submissions flow through the system.
EOT
}
- type = "text"
- width = 24
- x = 0
- y = 0
},
- {
- height = 6
- properties = {
- metrics = [
- [
- "AWS/SQS",
- "NumberOfMessagesReceived",
- "QueueName",
- "submission_processing.fifo",
- {
- color = "#8c564b"
},
],
]
- period = 300
- region = "ca-central-1"
- stacked = false
- stat = "Sum"
- title = "Queue: submission messages"
- view = "timeSeries"
}
- type = "metric"
- width = 8
- x = 0
- y = 14
},
- {
- height = 6
- properties = {
- metrics = [
- [
- "AWS/SQS",
- "ApproximateAgeOfOldestMessage",
- "QueueName",
- "submission_processing.fifo",
- {
- color = "#7f7f7f"
- label = "Oldest message age"
- region = "ca-central-1"
},
],
]
- period = 300
- region = "ca-central-1"
- sparkline = true
- stat = "Average"
- title = "Queue: submission message age"
- view = "singleValue"
}
- type = "metric"
- width = 4
- x = 8
- y = 14
},
- {
- height = 3
- properties = {
- background = "transparent"
- markdown = <<-EOT
# Form responses
Tracking form response list, retrieval and confirm.
EOT
}
- type = "text"
- width = 24
- x = 0
- y = 20
},
- {
- height = 3
- properties = {
- background = "transparent"
- markdown = <<-EOT
## Lambdas
Performance metrics for the Submission and Reliability functions.
EOT
}
- type = "text"
- width = 24
- x = 0
- y = 83
},
- {
- height = 7
- properties = {
- metrics = [
- [
- "AWS/ECS",
- "CPUUtilization",
- "ServiceName",
- "form-viewer",
- "ClusterName",
- "Forms",
- {
- region = "ca-central-1"
- stat = "Minimum"
},
],
- [
- "...",
- {
- region = "ca-central-1"
- stat = "Maximum"
},
],
- [
- "...",
- {
- region = "ca-central-1"
- stat = "Average"
},
],
]
- period = 300
- region = "ca-central-1"
- stacked = false
- title = "App: CPU use"
- view = "timeSeries"
}
- type = "metric"
- width = 8
- x = 0
- y = 76
},
- {
- height = 7
- properties = {
- metrics = [
- [
- "AWS/ECS",
- "MemoryUtilization",
- "ServiceName",
- "form-viewer",
- "ClusterName",
- "Forms",
- {
- stat = "Minimum"
},
],
- [
- "...",
- {
- stat = "Maximum"
},
],
- [
- "...",
- {
- stat = "Average"
},
],
]
- period = 300
- region = "ca-central-1"
- stacked = false
- title = "App: memory use"
- view = "timeSeries"
}
- type = "metric"
- width = 8
- x = 8
- y = 76
},
- {
- height = 3
- properties = {
- background = "transparent"
- markdown = <<-EOT
## Load balancer
Requests, errors and response time for the app's load balancer.
EOT
}
- type = "text"
- width = 24
- x = 0
- y = 98
},
- {
- height = 6
- properties = {
- metrics = [
- [
- "AWS/Lambda",
- "Invocations",
- "FunctionName",
- "Submission",
- {
- region = "ca-central-1"
},
],
- [
- ".",
- "Throttles",
- ".",
- ".",
- {
- color = "#ffbb78"
- region = "ca-central-1"
},
],
- [
- ".",
- "Errors",
- ".",
- ".",
- {
- color = "#d62728"
- region = "ca-central-1"
},
],
]
- period = 300
- region = "ca-central-1"
- stacked = false
- stat = "Sum"
- title = "Lambda: submission"
- view = "timeSeries"
}
- type = "metric"
- width = 18
- x = 0
- y = 86
},
- {
- height = 6
- properties = {
- metrics = [
- [
- "AWS/Lambda",
- "Duration",
- "FunctionName",
- "Submission",
- "Resource",
- "Submission",
- {
- color = "#555555"
- region = "ca-central-1"
},
],
]
- period = 300
- region = "ca-central-1"
- sparkline = true
- stacked = false
- stat = "Average"
- title = "Lambda: submission duration"
- view = "singleValue"
}
- type = "metric"
- width = 6
- x = 18
- y = 86
},
- {
- height = 6
- properties = {
- metrics = [
- [
- "AWS/Lambda",
- "Invocations",
- "FunctionName",
- "reliability",
- "Resource",
- "reliability",
- {
- region = "ca-central-1"
},
],
- [
- ".",
- "Throttles",
- ".",
- ".",
- ".",
- ".",
- {
- color = "#ffbb78"
- region = "ca-central-1"
},
],
- [
- ".",
- "Errors",
- ".",
- ".",
- ".",
- ".",
- {
- color = "#d62728"
- region = "ca-central-1"
},
],
]
- period = 300
- region = "ca-central-1"
- stacked = false
- stat = "Sum"
- title = "Lambda: reliability"
- view = "timeSeries"
}
- type = "metric"
- width = 18
- x = 0
- y = 92
},
- {
- height = 6
- properties = {
- metrics = [
- [
- "AWS/Lambda",
- "Duration",
- "FunctionName",
- "reliability",
- "Resource",
- "reliability",
- {
- color = "#555"
},
],
]
- period = 300
- region = "ca-central-1"
- sparkline = true
- stacked = false
- stat = "Average"
- title = "Lambda: reliabiity duration"
- view = "singleValue"
}
- type = "metric"
- width = 6
- x = 18
- y = 92
},
- {
- height = 7
- properties = {
- metrics = [
- [
- "ECS/ContainerInsights",
- "NetworkRxBytes",
- "ClusterName",
- "Forms",
- {
- color = "#1f77b4"
- region = "ca-central-1"
},
],
]
- period = 300
- region = "ca-central-1"
- stacked = false
- stat = "Sum"
- title = "App: network bytes"
- view = "timeSeries"
}
- type = "metric"
- width = 8
- x = 16
- y = 76
},
- {
- height = 7
- properties = {
- metrics = [
- [
- "AWS/ApplicationELB",
- "RequestCount",
- "LoadBalancer",
- "app/form-viewer/5e6bc2d9ab810b68",
- {
- color = "#2ca02c"
- label = "Request count"
- region = "ca-central-1"
},
],
- [
- ".",
- "HTTPCode_ELB_4XX_Count",
- ".",
- ".",
- {
- color = "#ffbb78"
- label = "4XX response count"
- region = "ca-central-1"
},
],
- [
- ".",
- "HTTPCode_ELB_5XX_Count",
- ".",
- ".",
- {
- color = "#d62728"
- label = "5XX response count"
- region = "ca-central-1"
},
],
]
- period = 300
- region = "ca-central-1"
- stacked = false
- stat = "Sum"
- title = "LB: requests"
- view = "timeSeries"
}
- type = "metric"
- width = 9
- x = 0
- y = 101
},
- {
- height = 7
- properties = {
- metrics = [
- [
- "AWS/ApplicationELB",
- "TargetResponseTime",
- "LoadBalancer",
- "app/form-viewer/5e6bc2d9ab810b68",
- {
- color = "#8c564b"
- region = "ca-central-1"
},
],
]
- period = 300
- region = "ca-central-1"
- sparkline = true
- stacked = false
- stat = "Average"
- title = "LB: response time"
- view = "singleValue"
}
- type = "metric"
- width = 6
- x = 18
- y = 101
},
- {
- height = 3
- properties = {
- background = "transparent"
- markdown = <<-EOT
## Database
Performance metrics for the database cluster.
EOT
}
- type = "text"
- width = 24
- x = 0
- y = 108
},
- {
- height = 7
- properties = {
- metrics = [
- [
- "AWS/ApplicationELB",
- "ActiveConnectionCount",
- "LoadBalancer",
- "app/form-viewer/5e6bc2d9ab810b68",
- {
- color = "#e377c2"
},
],
]
- period = 300
- region = "ca-central-1"
- stacked = false
- stat = "Average"
- title = "LB: connections"
- view = "timeSeries"
}
- type = "metric"
- width = 9
- x = 9
- y = 101
},
- {
- height = 8
- properties = {
- query = <<-EOT
SOURCE 'Forms' | SOURCE '/aws/lambda/Reliability' | SOURCE '/aws/lambda/Submission' | SOURCE '/aws/lambda/Nagware' | SOURCE '/aws/lambda/Response_Archiver' | SOURCE '/aws/lambda/Vault_Data_Integrity_Check' | fields @timestamp, @message, @logStream, @log
| filter level = 'error' or level = 'warn' or status = 'failed'
| filter @message not like /days since submission/
| sort @timestamp desc
| limit 1000
EOT
- region = "ca-central-1"
- stacked = false
- title = "Errors: app and lambdas"
- view = "table"
}
- type = "log"
- width = 20
- x = 0
- y = 64
},
- {
- height = 8
- properties = {
- alarms = [
- "arn:aws:cloudwatch:ca-central-1:687401027353:alarm:CpuUtilizationWarn",
- "arn:aws:cloudwatch:ca-central-1:687401027353:alarm:MemoryUtilizationWarn",
- "arn:aws:cloudwatch:ca-central-1:687401027353:alarm:HTTPCode_ELB_5XX_Count",
- "arn:aws:cloudwatch:ca-central-1:687401027353:alarm:ResponseTimeWarn",
- "arn:aws:cloudwatch:ca-central-1:687401027353:alarm:UnHealthyHostCount-TargetGroup1-SEV1",
- "arn:aws:cloudwatch:ca-central-1:687401027353:alarm:UnHealthyHostCount-TargetGroup2-SEV1",
- "arn:aws:cloudwatch:ca-central-1:687401027353:alarm:ReliabilityDeadLetterQueueWarn",
]
- title = "Alarms"
}
- type = "alarm"
- width = 4
- x = 20
- y = 64
},
- {
- height = 2
- properties = {
- background = "transparent"
- markdown = <<-EOT
# Performance
EOT
}
- type = "text"
- width = 24
- x = 0
- y = 72
},
- {
- height = 7
- properties = {
- query = <<-EOT
SOURCE 'Forms' | fields @message
| filter @message =~ /HealthCheck: cognito sign-up/
| parse @message "success" as @successCount
| parse @message "failure" as @failureCount
| stats count(@successCount) as Success, count(@failureCount) as Failed by bin(5m)
EOT
- region = "ca-central-1"
-... Show Conftest resultsWARN - plan.json - main - Missing Common Tags: ["aws_athena_data_catalog.dynamodb"]
WARN - plan.json - main - Missing Common Tags: ["aws_athena_data_catalog.rds_data_catalog"]
WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_event_rule.codedeploy_sns"]
WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_log_group.notify_slack"]
WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_metric_alarm.ELB_5xx_error_warn"]
WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_metric_alarm.ELB_healthy_hosts"]
WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_metric_alarm.UnHealthyHostCount-TargetGroup1"]
WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_metric_alarm.UnHealthyHostCount-TargetGroup2"]
WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_metric_alarm.alb_ddos"]
WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_metric_alarm.api_cpu_utilization_high_warn[0]"]
WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_metric_alarm.api_lb_healthy_host_count[0]"]
WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_metric_alarm.api_lb_unhealthy_host_count[0]"]
WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_metric_alarm.api_memory_utilization_high_warn[0]"]
WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_metric_alarm.api_response_time_warn[0]"]
WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_metric_alarm.audit_log_dead_letter_queue_warn"]
WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_metric_alarm.cognito_login_outside_canada_warn"]
WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_metric_alarm.cognito_signin_exceeded"]
WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_metric_alarm.ddos_detected_forms_warn"]
WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_metric_alarm.ddos_detected_route53_warn[0]"]
WARN - plan.json - main - Missing Common Tags:... |
bryan-robitaille
approved these changes
Sep 10, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Update the App, IdP and API unhealthy host alarms to only trigger warnings that post to Slack.
Add healthy host alarms that trigger SEV1 OpsGenie responses when a service has no healthy hosts. This will currently only trigger an OpsGenie page for the App load balancer target groups.
Related