Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add system health dashboard #688

Merged
merged 4 commits into from
Jun 14, 2024
Merged

Conversation

patheard
Copy link
Member

Add a CloudWatch dashboard that displays the system health metrics, logs and alarms. The goal is to make it easier for on-callers to identify where issues are occurring if they are paged.

Related

Add a CloudWatch dashboard that displays the system health
metrics, logs and alarms.  The goal is to make it easier for on-callers
to identify where issues are occuring if they are paged.
@patheard patheard force-pushed the feat/system-health-dashboard branch from f5c34e6 to 018b498 Compare June 14, 2024 15:38
Copy link

⚠ Terrform update available

Terraform: 1.8.5 (using 1.6.6)
Terragrunt: 0.59.3 (using 0.54.8)

Copy link

Staging: rds

✅   Terraform Init: success
✅   Terraform Validate: success
✅   Terraform Format: success
✅   Terraform Plan: success
✅   Conftest: success

Plan: 0 to add, 0 to change, 0 to destroy
Show summary
CHANGE NAME
Show plan
Changes to Outputs:
  + rds_cluster_identifier  = "forms-staging-db-cluster"

You can apply this plan to save these new output values to the Terraform
state, without changing any real infrastructure.

─────────────────────────────────────────────────────────────────────────────

Saved the plan to: plan.tfplan

To perform exactly these actions, run the following command to apply:
    terraform apply "plan.tfplan"
Show Conftest results
WARN - plan.json - main - Missing Common Tags: ["aws_rds_cluster.forms"]
WARN - plan.json - main - Missing Common Tags: ["aws_secretsmanager_secret.database_secret"]
WARN - plan.json - main - Missing Common Tags: ["aws_secretsmanager_secret.database_url"]

22 tests, 19 passed, 3 warnings, 0 failures, 0 exceptions

Copy link

Staging: alarms

✅   Terraform Init: success
✅   Terraform Validate: success
✅   Terraform Format: success
✅   Terraform Plan: success
✅   Conftest: success

Plan: 1 to add, 0 to change, 0 to destroy
Show summary
CHANGE NAME
add aws_cloudwatch_dashboard.forms_service_health

✂   Warning: plan has been truncated! See the full plan in the logs.

Show plan
Resource actions are indicated with the following symbols:
  + create

Terraform will perform the following actions:

  # aws_cloudwatch_dashboard.forms_service_health will be created
  + resource "aws_cloudwatch_dashboard" "forms_service_health" {
      + dashboard_arn  = (known after apply)
      + dashboard_body = jsonencode(
            {
              + widgets = [
                  + {
                      + height     = 12
                      + properties = {
                          + metrics = [
                              + [
                                  + "forms",
                                  + "ClientSubmitSuccess",
                                  + {
                                      + color = "#2ca02c"
                                    },
                                ],
                              + [
                                  + ".",
                                  + "ClientSubmitFailed",
                                  + {
                                      + color = "#d62728"
                                    },
                                ],
                            ]
                          + period  = 300
                          + region  = "ca-central-1"
                          + stacked = false
                          + stat    = "Sum"
                          + title   = "App: client submissions"
                          + view    = "timeSeries"
                        }
                      + type       = "metric"
                      + width      = 12
                      + x          = 0
                      + y          = 2
                    },
                  + {
                      + height     = 6
                      + properties = {
                          + metrics = [
                              + [
                                  + "forms",
                                  + "SubmissionSuccess",
                                  + {
                                      + color = "#2ca02c"
                                    },
                                ],
                              + [
                                  + ".",
                                  + "SubmissionWarn",
                                  + {
                                      + color = "#ffbb78"
                                    },
                                ],
                              + [
                                  + ".",
                                  + "SubmissionFailed",
                                  + {
                                      + color = "#d62728"
                                    },
                                ],
                            ]
                          + period  = 300
                          + region  = "ca-central-1"
                          + stacked = false
                          + stat    = "Sum"
                          + title   = "Lambda: submission"
                          + view    = "timeSeries"
                        }
                      + type       = "metric"
                      + width      = 12
                      + x          = 12
                      + y          = 2
                    },
                  + {
                      + height     = 8
                      + properties = {
                          + metrics   = [
                              + [
                                  + "AWS/RDS",
                                  + "CPUUtilization",
                                  + "DBClusterIdentifier",
                                  + "forms-mock-db-cluster",
                                  + {
                                      + color  = "#17becf"
                                      + region = "ca-central-1"
                                    },
                                ],
                            ]
                          + period    = 60
                          + region    = "ca-central-1"
                          + sparkline = true
                          + stacked   = false
                          + stat      = "Average"
                          + title     = "DB: CPU use"
                          + view      = "timeSeries"
                        }
                      + type       = "metric"
                      + width      = 6
                      + x          = 0
                      + y          = 74
                    },
                  + {
                      + height     = 6
                      + properties = {
                          + metrics = [
                              + [
                                  + "forms",
                                  + "ReliabilitySuccess",
                                  + {
                                      + color = "#2ca02c"
                                    },
                                ],
                              + [
                                  + ".",
                                  + "ReliabilityWarn",
                                  + {
                                      + color = "#ffbb78"
                                    },
                                ],
                              + [
                                  + ".",
                                  + "ReliabilityFailed",
                                  + {
                                      + color = "#d62728"
                                    },
                                ],
                            ]
                          + period  = 300
                          + region  = "ca-central-1"
                          + stacked = false
                          + stat    = "Sum"
                          + title   = "Lambda: reliability"
                          + view    = "timeSeries"
                        }
                      + type       = "metric"
                      + width      = 12
                      + x          = 12
                      + y          = 8
                    },
                  + {
                      + height     = 2
                      + properties = {
                          + background = "transparent"
                          + markdown   = ""
                        }
                      + type       = "text"
                      + width      = 24
                      + x          = 0
                      + y          = 20
                    },
                  + {
                      + height     = 8
                      + properties = {
                          + metrics   = [
                              + [
                                  + "AWS/RDS",
                                  + "FreeableMemory",
                                  + "DBClusterIdentifier",
                                  + "forms-mock-db-cluster",
                                  + {
                                      + color = "#9467bd"
                                    },
                                ],
                            ]
                          + period    = 60
                          + region    = "ca-central-1"
                          + sparkline = true
                          + stacked   = false
                          + stat      = "Average"
                          + title     = "DB: freeable memory"
                          + view      = "timeSeries"
                        }
                      + type       = "metric"
                      + width      = 6
                      + x          = 6
                      + y          = 74
                    },
                  + {
                      + height     = 8
                      + properties = {
                          + metrics   = [
                              + [
                                  + "AWS/RDS",
                                  + "ReadLatency",
                                  + "DBClusterIdentifier",
                                  + "forms-mock-db-cluster",
                                  + {
                                      + color = "#c5b0d5"
                                    },
                                ],
                            ]
                          + period    = 60
                          + region    = "ca-central-1"
                          + sparkline = true
                          + stacked   = false
                          + stat      = "Average"
                          + title     = "DB: read latency"
                          + view      = "timeSeries"
                        }
                      + type       = "metric"
                      + width      = 6
                      + x          = 12
                      + y          = 74
                    },
                  + {
                      + height     = 8
                      + properties = {
                          + metrics   = [
                              + [
                                  + "AWS/RDS",
                                  + "WriteLatency",
                                  + "DBClusterIdentifier",
                                  + "forms-mock-db-cluster",
                                  + {
                                      + color = "#7f7f7f"
                                    },
                                ],
                            ]
                          + period    = 60
                          + region    = "ca-central-1"
                          + sparkline = true
                          + stacked   = false
                          + stat      = "Average"
                          + title     = "DB: write latency"
                          + view      = "timeSeries"
                        }
                      + type       = "metric"
                      + width      = 6
                      + x          = 18
                      + y          = 74
                    },
                  + {
                      + height     = 2
                      + properties = {
                          + background = "transparent"
                          + markdown   = <<-EOT
                                # Form submissions
                                Tracking form submissions flow through the system.
                            EOT
                        }
                      + type       = "text"
                      + width      = 24
                      + x          = 0
                      + y          = 0
                    },
                  + {
                      + height     = 6
                      + properties = {
                          + metrics = [
                              + [
                                  + "AWS/SQS",
                                  + "NumberOfMessagesReceived",
                                  + "QueueName",
                                  + "submission_processing.fifo",
                                  + {
                                      + color = "#8c564b"
                                    },
                                ],
                            ]
                          + period  = 300
                          + region  = "ca-central-1"
                          + stacked = false
                          + stat    = "Sum"
                          + title   = "Queue: submission messages"
                          + view    = "timeSeries"
                        }
                      + type       = "metric"
                      + width      = 8
                      + x          = 0
                      + y          = 14
                    },
                  + {
                      + height     = 6
                      + properties = {
                          + metrics   = [
                              + [
                                  + "AWS/SQS",
                                  + "ApproximateAgeOfOldestMessage",
                                  + "QueueName",
                                  + "submission_processing.fifo",
                                  + {
                                      + color  = "#7f7f7f"
                                      + label  = "Oldest message age"
                                      + region = "ca-central-1"
                                    },
                                ],
                            ]
                          + period    = 300
                          + region    = "ca-central-1"
                          + sparkline = true
                          + stat      = "Average"
                          + title     = "Queue: submission message age"
                          + view      = "singleValue"
                        }
                      + type       = "metric"
                      + width      = 4
                      + x          = 8
                      + y          = 14
                    },
                  + {
                      + height     = 6
                      + properties = {
                          + metrics = [
                              + [
                                  + "forms",
                                  + "ReliabilityNotifySendSuccess",
                                  + {
                                      + color  = "#2ca02c"
                                      + region = "ca-central-1"
                                    },
                                ],
                              + [
                                  + ".",
                                  + "ReliabilityNotifySendFailed",
                                  + {
                                      + color  = "#d62728"
                                      + region = "ca-central-1"
                                    },
                                ],
                              + [
                                  + ".",
                                  + "ReliabilityVaultSaveSuccess",
                                  + {
                                      + color  = "#1f77b4"
                                      + region = "ca-central-1"
                                    },
                                ],
                              + [
                                  + ".",
                                  + "ReliabilityVaultSaveFailed",
                                  + {
                                      + color  = "#ff7f0e"
                                      + region = "ca-central-1"
                                    },
                                ],
                            ]
                          + period  = 300
                          + region  = "ca-central-1"
                          + stacked = false
                          + stat    = "Sum"
                          + title   = "Lambda: reliability send/save"
                          + view    = "timeSeries"
                        }
                      + type       = "metric"
                      + width      = 12
                      + x          = 12
                      + y          = 14
                    },
                  + {
                      + height     = 2
                      + properties = {
                          + background = "transparent"
                          + markdown   = ""
                        }
                      + type       = "text"
                      + width      = 24
                      + x          = 0
                      + y          = 43
                    },
                  + {
                      + height     = 2
                      + properties = {
                          + background = "transparent"
                          + markdown   = <<-EOT
                                # Errors
                                Error logs and alarms from the app and lambdas.
                            EOT
                        }
                      + type       = "text"
                      + width      = 24
                      + x          = 0
                      + y          = 22
                    },
                  + {
                      + height     = 2
                      + properties = {
                          + background = "transparent"
                          + markdown   = <<-EOT
                                # Lambdas
                                Performance metrics for the Lambda functions.
                            EOT
                        }
                      + type       = "text"
                      + width      = 24
                      + x          = 0
                      + y          = 45
                    },
                  + {
                      + height     = 7
                      + properties = {
                          + metrics = [
                              + [
                                  + "AWS/ECS",
                                  + "CPUUtilization",
                                  + "ServiceName",
                                  + "form-viewer",
                                  + "ClusterName",
                                  + "Forms",
                                  + {
                                      + region = "ca-central-1"
                                      + stat   = "Minimum"
                                    },
                                ],
                              + [
                                  + "...",
                                  + {
                                      + region = "ca-central-1"
                                      + stat   = "Maximum"
                                    },
                                ],
                              + [
                                  + "...",
                                  + {
                                      + region = "ca-central-1"
                                      + stat   = "Average"
                                    },
                                ],
                            ]
                          + period  = 300
                          + region  = "ca-central-1"
                          + stacked = false
                          + title   = "App: CPU use"
                          + view    = "timeSeries"
                        }
                      + type       = "metric"
                      + width      = 8
                      + x          = 0
                      + y          = 36
                    },
                  + {
                      + height     = 7
                      + properties = {
                          + metrics = [
                              + [
                                  + "AWS/ECS",
                                  + "MemoryUtilization",
                                  + "ServiceName",
                                  + "form-viewer",
                                  + "ClusterName",
                                  + "Forms",
                                  + {
                                      + stat = "Minimum"
                                    },
                                ],
                              + [
                                  + "...",
                                  + {
                                      + stat = "Maximum"
                                    },
                                ],
                              + [
                                  + "...",
                                  + {
                                      + stat = "Average"
                                    },
                                ],
                            ]
                          + period  = 300
                          + region  = "ca-central-1"
                          + stacked = false
                          + title   = "App: memory use"
                          + view    = "timeSeries"
                        }
                      + type       = "metric"
                      + width      = 8
                      + x          = 8
                      + y          = 36
                    },
                  + {
                      + height     = 2
                      + properties = {
                          + background = "transparent"
                          + markdown   = ""
                        }
                      + type       = "text"
                      + width      = 24
                      + x          = 0
                      + y          = 59
                    },
                  + {
                      + height     = 2
                      + properties = {
                          + background = "transparent"
                          + markdown   = <<-EOT
                                # Load balancer
                                Requests, errors and response time for the app's load balancer.
                            EOT
                        }
                      + type       = "text"
                      + width      = 24
                      + x          = 0
                      + y          = 61
                    },
                  + {
                      + height     = 6
                      + properties = {
                          + metrics = [
                              + [
                                  + "AWS/Lambda",
                                  + "Invocations",
                                  + "FunctionName",
                                  + "Submission",
                                  + {
                                      + region = "ca-central-1"
                                    },
                                ],
                              + [
                                  + ".",
                                  + "Throttles",
                                  + ".",
                                  + ".",
                                  + {
                                      + color  = "#ffbb78"
                                      + region = "ca-central-1"
                                    },
                                ],
                              + [
                                  + ".",
                                  + "Errors",
                                  + ".",
                                  + ".",
                                  + {
                                      + region = "ca-central-1"
                                    },
                                ],
                            ]
                          + period  = 300
                          + region  = "ca-central-1"
                          + stacked = false
                          + stat    = "Sum"
                          + title   = "Lambda: submission"
                          + view    = "timeSeries"
                        }
                      + type       = "metric"
                      + width      = 18
                      + x          = 0
                      + y          = 47
                    },
                  + {
                      + height     = 6
                      + properties = {
                          + metrics   = [
                              + [
                                  + "AWS/Lambda",
                                  + "Duration",
                                  + "FunctionName",
                                  + "Submission",
                                  + "Resource",
                                  + "Submission",
                                  + {
                                      + color  = "#555555"
                                      + region = "ca-central-1"
                                    },
                                ],
                            ]
                          + period    = 300
                          + region    = "ca-central-1"
                          + sparkline = true
                          + stacked   = false
                          + stat      = "Average"
                          + title     = "Lambda: submission duration"
                          + view      = "singleValue"
                        }
                      + type       = "metric"
                      + width      = 6
                      + x          = 18
                      + y          = 47
                    },
                  + {
                      + height     = 6
                      + properties = {
                          + metrics = [
                              + [
                                  + "AWS/Lambda",
                                  + "Invocations",
                                  + "FunctionName",
                                  + "reliability",
                                  + "Resource",
                                  + "reliability",
                                  + {
                                      + region = "ca-central-1"
                                    },
                                ],
                              + [
                                  + ".",
                                  + "Throttles",
                                  + ".",
                                  + ".",
                                  + ".",
                                  + ".",
                                  + {
                                      + color  = "#ffbb78"
                                      + region = "ca-central-1"
                                    },
                                ],
                              + [
                                  + ".",
                                  + "Errors",
                                  + ".",
                                  + ".",
                                  + ".",
                                  + ".",
                                  + {
                                      + region = "ca-central-1"
                                    },
                                ],
                            ]
                          + period  = 300
                          + region  = "ca-central-1"
                          + stacked = false
                          + stat    = "Sum"
                          + title   = "Lambda: reliability"
                          + view    = "timeSeries"
                        }
                      + type       = "metric"
                      + width      = 18
                      + x          = 0
                      + y          = 53
                    },
                  + {
                      + height     = 6
                      + properties = {
                          + metrics   = [
                              + [
                                  + "AWS/Lambda",
                                  + "Duration",
                                  + "FunctionName",
                                  + "reliability",
                                  + "Resource",
                                  + "reliability",
                                  + {
                                      + color = "#555"
                                    },
                                ],
                            ]
                          + period    = 300
                          + region    = "ca-central-1"
                          + sparkline = true
                          + stacked   = false
                          + stat      = "Average"
                          + title     = "Lambda: reliabiity duration"
                          + view      = "singleValue"
                        }
                      + type       = "metric"
                      + width      = 6
                      + x          = 18
                      + y          = 53
                    },
                  + {
                      + height     = 7
                      + properties = {
                          + metrics = [
                              + [
                                  + "ECS/ContainerInsights",
                                  + "NetworkRxBytes",
                                  + "ClusterName",
                                  + "Forms",
                                  + {
                                      + color  = "#1f77b4"
                                      + region = "ca-central-1"
                                    },
                                ],
                            ]
                          + period  = 300
                          + region  = "ca-central-1"
                          + stacked = false
                          + stat    = "Sum"
                          + title   = "App: network bytes"
                          + view    = "timeSeries"
                        }
                      + type       = "metric"
                      + width      = 8
                      + x          = 16
                      + y          = 36
                    },
                  + {
                      + height     = 7
                      + properties = {
                          + metrics = [
                              + [
                                  + "AWS/ApplicationELB",
                                  + "RequestCount",
                                  + "LoadBalancer",
                                  + "app/form-viewer/5e6bc2d9ab810b68",
                                  + {
                                      + color  = "#2ca02c"
                                      + label  = "Request count"
                                      + region = "ca-central-1"
                                    },
                                ],
                              + [
                                  + ".",
                                  + "HTTPCode_ELB_4XX_Count",
                                  + ".",
                                  + ".",
                                  + {
                                      + color  = "#ffbb78"
                                      + label  =...
Show Conftest results
WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_event_rule.codedeploy_sns"]
WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_log_group.notify_slack"]
WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_metric_alarm.ELB_5xx_error_warn"]
WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_metric_alarm.UnHealthyHostCount-TargetGroup1"]
WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_metric_alarm.UnHealthyHostCount-TargetGroup2"]
WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_metric_alarm.alb_ddos"]
WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_metric_alarm.audit_log_dead_letter_queue_warn"]
WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_metric_alarm.cognito_login_outside_canada_warn"]
WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_metric_alarm.cognito_signin_exceeded"]
WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_metric_alarm.ddos_detected_forms_warn"]
WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_metric_alarm.ddos_detected_route53_warn[0]"]
WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_metric_alarm.forms_cpu_utilization_high_warn"]
WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_metric_alarm.forms_memory_utilization_high_warn"]
WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_metric_alarm.reliability_dead_letter_queue_warn"]
WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_metric_alarm.response_time_warn"]
WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_metric_alarm.route53_ddos[0]"]
WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_metric_alarm.twoFa_verification_exceeded"]
WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_metric_alarm.vault_data_integrity_check_lambda_iterator_age"]
WARN - plan.json - main - Missing Common Tags: ["aws_iam_role.notify_slack_lambda"]
WARN - plan.json -...

@patheard patheard marked this pull request as ready for review June 14, 2024 15:50
Copy link
Contributor

@craigzour craigzour left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@patheard patheard merged commit 74b810f into develop Jun 14, 2024
10 checks passed
@patheard patheard deleted the feat/system-health-dashboard branch June 14, 2024 17:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants