Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: add HealthyHostCount alarms to App, IdP, API #818

Merged
merged 1 commit into from
Sep 10, 2024

Conversation

patheard
Copy link
Member

Summary

Update the App, IdP and API unhealthy host alarms to only trigger warnings that post to Slack.

Add healthy host alarms that trigger SEV1 OpsGenie responses when a service has no healthy hosts. This will currently only trigger an OpsGenie page for the App load balancer target groups.

Related

Update the App, IdP and API unhealthy host alarms to only trigger warnings that post
to Slack.

Add healthy host alarms that trigger SEV1 OpsGenie responses when a service has
no healthy hosts.  This will currently only trigger an OpsGenie page for the App
load balancer target groups.
@patheard patheard self-assigned this Sep 10, 2024
Copy link

⚠ Terrform update available

Terraform: 1.9.5 (using 1.9.2)
Terragrunt: 0.67.4 (using 0.63.2)

Copy link

Staging: alarms

✅   Terraform Init: success
✅   Terraform Validate: success
✅   Terraform Format: success
✅   Terraform Plan: success
✅   Conftest: success

⚠️   Warning: resources will be destroyed by this change!

Plan: 6 to add, 3 to change, 2 to destroy
Show summary
CHANGE NAME
add aws_cloudwatch_metric_alarm.ELB_healthy_hosts
aws_cloudwatch_metric_alarm.api_lb_healthy_host_count[0]
aws_cloudwatch_metric_alarm.idb_lb_healthy_host_count["HTTP1"]
aws_cloudwatch_metric_alarm.idb_lb_healthy_host_count["HTTP2"]
update aws_cloudwatch_dashboard.forms_service_health
aws_cloudwatch_metric_alarm.idb_lb_unhealthy_host_count["HTTP1"]
aws_cloudwatch_metric_alarm.idb_lb_unhealthy_host_count["HTTP2"]
recreate aws_cloudwatch_metric_alarm.UnHealthyHostCount-TargetGroup1
aws_cloudwatch_metric_alarm.UnHealthyHostCount-TargetGroup2

✂   Warning: plan has been truncated! See the full plan in the logs.

Show plan
Resource actions are indicated with the following symbols:
  + create
  ~ update in-place
-/+ destroy and then create replacement

Terraform will perform the following actions:

  # aws_cloudwatch_dashboard.forms_service_health will be updated in-place
  ~ resource "aws_cloudwatch_dashboard" "forms_service_health" {
      ~ dashboard_body = jsonencode(
            {
              - widgets = [
                  - {
                      - height     = 8
                      - properties = {
                          - metrics   = [
                              - [
                                  - "AWS/RDS",
                                  - "CPUUtilization",
                                  - "DBClusterIdentifier",
                                  - "forms-staging-db-cluster",
                                  - {
                                      - color  = "#17becf"
                                      - region = "ca-central-1"
                                    },
                                ],
                            ]
                          - period    = 60
                          - region    = "ca-central-1"
                          - sparkline = true
                          - stacked   = false
                          - stat      = "Average"
                          - title     = "DB: CPU use"
                          - view      = "timeSeries"
                        }
                      - type       = "metric"
                      - width      = 6
                      - x          = 0
                      - y          = 111
                    },
                  - {
                      - height     = 8
                      - properties = {
                          - metrics   = [
                              - [
                                  - "AWS/RDS",
                                  - "FreeableMemory",
                                  - "DBClusterIdentifier",
                                  - "forms-staging-db-cluster",
                                  - {
                                      - color = "#9467bd"
                                    },
                                ],
                            ]
                          - period    = 60
                          - region    = "ca-central-1"
                          - sparkline = true
                          - stacked   = false
                          - stat      = "Average"
                          - title     = "DB: freeable memory"
                          - view      = "timeSeries"
                        }
                      - type       = "metric"
                      - width      = 6
                      - x          = 6
                      - y          = 111
                    },
                  - {
                      - height     = 8
                      - properties = {
                          - metrics   = [
                              - [
                                  - "AWS/RDS",
                                  - "ReadLatency",
                                  - "DBClusterIdentifier",
                                  - "forms-staging-db-cluster",
                                  - {
                                      - color = "#c5b0d5"
                                    },
                                ],
                            ]
                          - period    = 60
                          - region    = "ca-central-1"
                          - sparkline = true
                          - stacked   = false
                          - stat      = "Average"
                          - title     = "DB: read latency"
                          - view      = "timeSeries"
                        }
                      - type       = "metric"
                      - width      = 6
                      - x          = 12
                      - y          = 111
                    },
                  - {
                      - height     = 8
                      - properties = {
                          - metrics   = [
                              - [
                                  - "AWS/RDS",
                                  - "WriteLatency",
                                  - "DBClusterIdentifier",
                                  - "forms-staging-db-cluster",
                                  - {
                                      - color = "#7f7f7f"
                                    },
                                ],
                            ]
                          - period    = 60
                          - region    = "ca-central-1"
                          - sparkline = true
                          - stacked   = false
                          - stat      = "Average"
                          - title     = "DB: write latency"
                          - view      = "timeSeries"
                        }
                      - type       = "metric"
                      - width      = 6
                      - x          = 18
                      - y          = 111
                    },
                  - {
                      - height     = 2
                      - properties = {
                          - background = "transparent"
                          - markdown   = <<-EOT
                                # Form submissions
                                Tracking form submissions flow through the system.
                            EOT
                        }
                      - type       = "text"
                      - width      = 24
                      - x          = 0
                      - y          = 0
                    },
                  - {
                      - height     = 6
                      - properties = {
                          - metrics = [
                              - [
                                  - "AWS/SQS",
                                  - "NumberOfMessagesReceived",
                                  - "QueueName",
                                  - "submission_processing.fifo",
                                  - {
                                      - color = "#8c564b"
                                    },
                                ],
                            ]
                          - period  = 300
                          - region  = "ca-central-1"
                          - stacked = false
                          - stat    = "Sum"
                          - title   = "Queue: submission messages"
                          - view    = "timeSeries"
                        }
                      - type       = "metric"
                      - width      = 8
                      - x          = 0
                      - y          = 14
                    },
                  - {
                      - height     = 6
                      - properties = {
                          - metrics   = [
                              - [
                                  - "AWS/SQS",
                                  - "ApproximateAgeOfOldestMessage",
                                  - "QueueName",
                                  - "submission_processing.fifo",
                                  - {
                                      - color  = "#7f7f7f"
                                      - label  = "Oldest message age"
                                      - region = "ca-central-1"
                                    },
                                ],
                            ]
                          - period    = 300
                          - region    = "ca-central-1"
                          - sparkline = true
                          - stat      = "Average"
                          - title     = "Queue: submission message age"
                          - view      = "singleValue"
                        }
                      - type       = "metric"
                      - width      = 4
                      - x          = 8
                      - y          = 14
                    },
                  - {
                      - height     = 3
                      - properties = {
                          - background = "transparent"
                          - markdown   = <<-EOT
                                # Form responses
                                Tracking form response list, retrieval and confirm.
                            EOT
                        }
                      - type       = "text"
                      - width      = 24
                      - x          = 0
                      - y          = 20
                    },
                  - {
                      - height     = 3
                      - properties = {
                          - background = "transparent"
                          - markdown   = <<-EOT
                                ## Lambdas
                                Performance metrics for the Submission and Reliability functions.
                            EOT
                        }
                      - type       = "text"
                      - width      = 24
                      - x          = 0
                      - y          = 83
                    },
                  - {
                      - height     = 7
                      - properties = {
                          - metrics = [
                              - [
                                  - "AWS/ECS",
                                  - "CPUUtilization",
                                  - "ServiceName",
                                  - "form-viewer",
                                  - "ClusterName",
                                  - "Forms",
                                  - {
                                      - region = "ca-central-1"
                                      - stat   = "Minimum"
                                    },
                                ],
                              - [
                                  - "...",
                                  - {
                                      - region = "ca-central-1"
                                      - stat   = "Maximum"
                                    },
                                ],
                              - [
                                  - "...",
                                  - {
                                      - region = "ca-central-1"
                                      - stat   = "Average"
                                    },
                                ],
                            ]
                          - period  = 300
                          - region  = "ca-central-1"
                          - stacked = false
                          - title   = "App: CPU use"
                          - view    = "timeSeries"
                        }
                      - type       = "metric"
                      - width      = 8
                      - x          = 0
                      - y          = 76
                    },
                  - {
                      - height     = 7
                      - properties = {
                          - metrics = [
                              - [
                                  - "AWS/ECS",
                                  - "MemoryUtilization",
                                  - "ServiceName",
                                  - "form-viewer",
                                  - "ClusterName",
                                  - "Forms",
                                  - {
                                      - stat = "Minimum"
                                    },
                                ],
                              - [
                                  - "...",
                                  - {
                                      - stat = "Maximum"
                                    },
                                ],
                              - [
                                  - "...",
                                  - {
                                      - stat = "Average"
                                    },
                                ],
                            ]
                          - period  = 300
                          - region  = "ca-central-1"
                          - stacked = false
                          - title   = "App: memory use"
                          - view    = "timeSeries"
                        }
                      - type       = "metric"
                      - width      = 8
                      - x          = 8
                      - y          = 76
                    },
                  - {
                      - height     = 3
                      - properties = {
                          - background = "transparent"
                          - markdown   = <<-EOT
                                ## Load balancer
                                Requests, errors and response time for the app's load balancer.
                            EOT
                        }
                      - type       = "text"
                      - width      = 24
                      - x          = 0
                      - y          = 98
                    },
                  - {
                      - height     = 6
                      - properties = {
                          - metrics = [
                              - [
                                  - "AWS/Lambda",
                                  - "Invocations",
                                  - "FunctionName",
                                  - "Submission",
                                  - {
                                      - region = "ca-central-1"
                                    },
                                ],
                              - [
                                  - ".",
                                  - "Throttles",
                                  - ".",
                                  - ".",
                                  - {
                                      - color  = "#ffbb78"
                                      - region = "ca-central-1"
                                    },
                                ],
                              - [
                                  - ".",
                                  - "Errors",
                                  - ".",
                                  - ".",
                                  - {
                                      - color  = "#d62728"
                                      - region = "ca-central-1"
                                    },
                                ],
                            ]
                          - period  = 300
                          - region  = "ca-central-1"
                          - stacked = false
                          - stat    = "Sum"
                          - title   = "Lambda: submission"
                          - view    = "timeSeries"
                        }
                      - type       = "metric"
                      - width      = 18
                      - x          = 0
                      - y          = 86
                    },
                  - {
                      - height     = 6
                      - properties = {
                          - metrics   = [
                              - [
                                  - "AWS/Lambda",
                                  - "Duration",
                                  - "FunctionName",
                                  - "Submission",
                                  - "Resource",
                                  - "Submission",
                                  - {
                                      - color  = "#555555"
                                      - region = "ca-central-1"
                                    },
                                ],
                            ]
                          - period    = 300
                          - region    = "ca-central-1"
                          - sparkline = true
                          - stacked   = false
                          - stat      = "Average"
                          - title     = "Lambda: submission duration"
                          - view      = "singleValue"
                        }
                      - type       = "metric"
                      - width      = 6
                      - x          = 18
                      - y          = 86
                    },
                  - {
                      - height     = 6
                      - properties = {
                          - metrics = [
                              - [
                                  - "AWS/Lambda",
                                  - "Invocations",
                                  - "FunctionName",
                                  - "reliability",
                                  - "Resource",
                                  - "reliability",
                                  - {
                                      - region = "ca-central-1"
                                    },
                                ],
                              - [
                                  - ".",
                                  - "Throttles",
                                  - ".",
                                  - ".",
                                  - ".",
                                  - ".",
                                  - {
                                      - color  = "#ffbb78"
                                      - region = "ca-central-1"
                                    },
                                ],
                              - [
                                  - ".",
                                  - "Errors",
                                  - ".",
                                  - ".",
                                  - ".",
                                  - ".",
                                  - {
                                      - color  = "#d62728"
                                      - region = "ca-central-1"
                                    },
                                ],
                            ]
                          - period  = 300
                          - region  = "ca-central-1"
                          - stacked = false
                          - stat    = "Sum"
                          - title   = "Lambda: reliability"
                          - view    = "timeSeries"
                        }
                      - type       = "metric"
                      - width      = 18
                      - x          = 0
                      - y          = 92
                    },
                  - {
                      - height     = 6
                      - properties = {
                          - metrics   = [
                              - [
                                  - "AWS/Lambda",
                                  - "Duration",
                                  - "FunctionName",
                                  - "reliability",
                                  - "Resource",
                                  - "reliability",
                                  - {
                                      - color = "#555"
                                    },
                                ],
                            ]
                          - period    = 300
                          - region    = "ca-central-1"
                          - sparkline = true
                          - stacked   = false
                          - stat      = "Average"
                          - title     = "Lambda: reliabiity duration"
                          - view      = "singleValue"
                        }
                      - type       = "metric"
                      - width      = 6
                      - x          = 18
                      - y          = 92
                    },
                  - {
                      - height     = 7
                      - properties = {
                          - metrics = [
                              - [
                                  - "ECS/ContainerInsights",
                                  - "NetworkRxBytes",
                                  - "ClusterName",
                                  - "Forms",
                                  - {
                                      - color  = "#1f77b4"
                                      - region = "ca-central-1"
                                    },
                                ],
                            ]
                          - period  = 300
                          - region  = "ca-central-1"
                          - stacked = false
                          - stat    = "Sum"
                          - title   = "App: network bytes"
                          - view    = "timeSeries"
                        }
                      - type       = "metric"
                      - width      = 8
                      - x          = 16
                      - y          = 76
                    },
                  - {
                      - height     = 7
                      - properties = {
                          - metrics = [
                              - [
                                  - "AWS/ApplicationELB",
                                  - "RequestCount",
                                  - "LoadBalancer",
                                  - "app/form-viewer/5e6bc2d9ab810b68",
                                  - {
                                      - color  = "#2ca02c"
                                      - label  = "Request count"
                                      - region = "ca-central-1"
                                    },
                                ],
                              - [
                                  - ".",
                                  - "HTTPCode_ELB_4XX_Count",
                                  - ".",
                                  - ".",
                                  - {
                                      - color  = "#ffbb78"
                                      - label  = "4XX response count"
                                      - region = "ca-central-1"
                                    },
                                ],
                              - [
                                  - ".",
                                  - "HTTPCode_ELB_5XX_Count",
                                  - ".",
                                  - ".",
                                  - {
                                      - color  = "#d62728"
                                      - label  = "5XX response count"
                                      - region = "ca-central-1"
                                    },
                                ],
                            ]
                          - period  = 300
                          - region  = "ca-central-1"
                          - stacked = false
                          - stat    = "Sum"
                          - title   = "LB: requests"
                          - view    = "timeSeries"
                        }
                      - type       = "metric"
                      - width      = 9
                      - x          = 0
                      - y          = 101
                    },
                  - {
                      - height     = 7
                      - properties = {
                          - metrics   = [
                              - [
                                  - "AWS/ApplicationELB",
                                  - "TargetResponseTime",
                                  - "LoadBalancer",
                                  - "app/form-viewer/5e6bc2d9ab810b68",
                                  - {
                                      - color  = "#8c564b"
                                      - region = "ca-central-1"
                                    },
                                ],
                            ]
                          - period    = 300
                          - region    = "ca-central-1"
                          - sparkline = true
                          - stacked   = false
                          - stat      = "Average"
                          - title     = "LB: response time"
                          - view      = "singleValue"
                        }
                      - type       = "metric"
                      - width      = 6
                      - x          = 18
                      - y          = 101
                    },
                  - {
                      - height     = 3
                      - properties = {
                          - background = "transparent"
                          - markdown   = <<-EOT
                                ## Database
                                Performance metrics for the database cluster.
                            EOT
                        }
                      - type       = "text"
                      - width      = 24
                      - x          = 0
                      - y          = 108
                    },
                  - {
                      - height     = 7
                      - properties = {
                          - metrics = [
                              - [
                                  - "AWS/ApplicationELB",
                                  - "ActiveConnectionCount",
                                  - "LoadBalancer",
                                  - "app/form-viewer/5e6bc2d9ab810b68",
                                  - {
                                      - color = "#e377c2"
                                    },
                                ],
                            ]
                          - period  = 300
                          - region  = "ca-central-1"
                          - stacked = false
                          - stat    = "Average"
                          - title   = "LB: connections"
                          - view    = "timeSeries"
                        }
                      - type       = "metric"
                      - width      = 9
                      - x          = 9
                      - y          = 101
                    },
                  - {
                      - height     = 8
                      - properties = {
                          - query   = <<-EOT
                                SOURCE 'Forms' | SOURCE '/aws/lambda/Reliability' | SOURCE '/aws/lambda/Submission' | SOURCE '/aws/lambda/Nagware' | SOURCE '/aws/lambda/Response_Archiver' | SOURCE '/aws/lambda/Vault_Data_Integrity_Check' | fields @timestamp, @message, @logStream, @log
                                | filter level = 'error' or level = 'warn' or status = 'failed'
                                | filter @message not like /days since submission/
                                | sort @timestamp desc
                                | limit 1000
                            EOT
                          - region  = "ca-central-1"
                          - stacked = false
                          - title   = "Errors: app and lambdas"
                          - view    = "table"
                        }
                      - type       = "log"
                      - width      = 20
                      - x          = 0
                      - y          = 64
                    },
                  - {
                      - height     = 8
                      - properties = {
                          - alarms = [
                              - "arn:aws:cloudwatch:ca-central-1:687401027353:alarm:CpuUtilizationWarn",
                              - "arn:aws:cloudwatch:ca-central-1:687401027353:alarm:MemoryUtilizationWarn",
                              - "arn:aws:cloudwatch:ca-central-1:687401027353:alarm:HTTPCode_ELB_5XX_Count",
                              - "arn:aws:cloudwatch:ca-central-1:687401027353:alarm:ResponseTimeWarn",
                              - "arn:aws:cloudwatch:ca-central-1:687401027353:alarm:UnHealthyHostCount-TargetGroup1-SEV1",
                              - "arn:aws:cloudwatch:ca-central-1:687401027353:alarm:UnHealthyHostCount-TargetGroup2-SEV1",
                              - "arn:aws:cloudwatch:ca-central-1:687401027353:alarm:ReliabilityDeadLetterQueueWarn",
                            ]
                          - title  = "Alarms"
                        }
                      - type       = "alarm"
                      - width      = 4
                      - x          = 20
                      - y          = 64
                    },
                  - {
                      - height     = 2
                      - properties = {
                          - background = "transparent"
                          - markdown   = <<-EOT
                                # Performance
                            EOT
                        }
                      - type       = "text"
                      - width      = 24
                      - x          = 0
                      - y          = 72
                    },
                  - {
                      - height     = 7
                      - properties = {
                          - query   = <<-EOT
                                SOURCE 'Forms' | fields @message
                                | filter @message =~ /HealthCheck: cognito sign-up/
                                | parse @message "success" as @successCount
                                | parse @message "failure" as @failureCount
                                | stats count(@successCount) as Success, count(@failureCount) as Failed by bin(5m)
                            EOT
                          - region  = "ca-central-1"
                          -...
Show Conftest results
WARN - plan.json - main - Missing Common Tags: ["aws_athena_data_catalog.dynamodb"]
WARN - plan.json - main - Missing Common Tags: ["aws_athena_data_catalog.rds_data_catalog"]
WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_event_rule.codedeploy_sns"]
WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_log_group.notify_slack"]
WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_metric_alarm.ELB_5xx_error_warn"]
WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_metric_alarm.ELB_healthy_hosts"]
WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_metric_alarm.UnHealthyHostCount-TargetGroup1"]
WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_metric_alarm.UnHealthyHostCount-TargetGroup2"]
WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_metric_alarm.alb_ddos"]
WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_metric_alarm.api_cpu_utilization_high_warn[0]"]
WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_metric_alarm.api_lb_healthy_host_count[0]"]
WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_metric_alarm.api_lb_unhealthy_host_count[0]"]
WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_metric_alarm.api_memory_utilization_high_warn[0]"]
WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_metric_alarm.api_response_time_warn[0]"]
WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_metric_alarm.audit_log_dead_letter_queue_warn"]
WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_metric_alarm.cognito_login_outside_canada_warn"]
WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_metric_alarm.cognito_signin_exceeded"]
WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_metric_alarm.ddos_detected_forms_warn"]
WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_metric_alarm.ddos_detected_route53_warn[0]"]
WARN - plan.json - main - Missing Common Tags:...

@patheard patheard marked this pull request as ready for review September 10, 2024 18:31
@patheard patheard merged commit 0e2301d into develop Sep 10, 2024
11 checks passed
@patheard patheard deleted the fix/healthy-host-count-alarms branch September 10, 2024 19:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants