Skip to content

Commit

Permalink
Merge pull request #12 from qonto/add-prometheus-unittests
Browse files Browse the repository at this point in the history
Add prometheus unittests
  • Loading branch information
vmercierfr authored Nov 23, 2023
2 parents a9d17d8 + 7a46a9e commit 351310e
Show file tree
Hide file tree
Showing 34 changed files with 806 additions and 13 deletions.
17 changes: 17 additions & 0 deletions .github/workflows/test.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -68,3 +68,20 @@ jobs:
- name: Run Kubeconform test
run: make kubeconform-test

prometheus-rules:
name: prometheus-rules
runs-on: ubuntu-latest
env:
PROMETHEUS_VERSION: 2.48.0
steps:
- uses: actions/checkout@v3

- name: Install Promtool (Prometheus)
run: |
curl -sSLo /tmp/prometheus.tar.gz "https://github.com/prometheus/prometheus/releases/download/v${PROMETHEUS_VERSION}/prometheus-${PROMETHEUS_VERSION}.linux-amd64.tar.gz" \
&& tar -C /tmp -xzvf /tmp/prometheus.tar.gz \
&& cp /tmp/prometheus-${PROMETHEUS_VERSION}.linux-amd64/promtool /usr/local/bin/promtool
- name: Prometheus test
run: make prometheus-test
11 changes: 11 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ We use :

* [`pre-commit`](https://pre-commit.com) to have style consistency in runbooks.
* [`markdownlint-cli2`](https://github.com/DavidAnson/markdownlint-cli2) for linting the Markdown document. If you think the linter is incorrect, look at [configuration](https://github.com/DavidAnson/markdownlint/blob/main/README.md#configuration) to ignore the line.
* [`promtool`](https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/#syntax-checking-rules) to check Prometheus rules
* Gitlab workflows to run tests

## Pull Request Checklist
Expand Down Expand Up @@ -77,12 +78,20 @@ To have Hugo templates resolution, we recommend to edit pages using Hugo webserv

## Run tests

Requirements

* `promtool` to run prometheus test (`brew install prometheus`)
* `kubeconform` (`brew install kubeconform`)

Steps

`make all-tests` run all tests, but you can run them manually:

```bash
make helm-test # Helm unit tests
make kubeconform-test # Check Helm charts render valid Kubernetes manifests
make runbook-test # Test runbooks
make prometheus-test # Test Prometheus alerts
```

Tests on SQL queries (files with `.sql` extension) are tested accross a PostgreSQL instance and need to be launched manually.
Expand Down Expand Up @@ -146,6 +155,8 @@ Any runboks:
message: "<comprehensive description of the alert>"
```

1. Add [Prometheus unittest](https://prometheus.io/docs/prometheus/latest/configuration/unit_testing_rules/) in `chart/prometheus-<component>-alerts/prometheus_tests/<alertName>.yml`

1. Run tests

```bash
Expand Down
7 changes: 6 additions & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -21,8 +21,13 @@ runbook-test:
sql-test:
PGHOST=localhost PGUSER=postgres PGPASSWORD=hackme PGDATABASE=test ./scripts/validate_sql_files.sh content $(PGDATABASE)

.PHONY: prometheus-test
prometheus-test:
./scripts/check_prometheus_rules.sh charts/prometheus-rds-alerts
./scripts/check_prometheus_rules.sh charts/prometheus-postgresql-alerts

.PHONY: all-tests
all-tests: helm-test kubeconform-test runbook-test sql-test
all-tests: helm-test kubeconform-test runbook-test sql-test prometheus-test

.PHONY: helm-release
helm-release:
Expand Down
1 change: 1 addition & 0 deletions charts/prometheus-postgresql-alerts/.helmignore
Original file line number Diff line number Diff line change
@@ -1 +1,2 @@
tests
prometheus_tests
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
rule_files:
- rules.yml

evaluation_interval: 1m

tests:

- name: PostgreSQLExporterDown
interval: 1m
input_series:
- series: 'up{instance="localhost:9187"}'
values: '0x10'
- series: 'postgres_exporter_build_info{instance="localhost:9187"}'
values: '1x10'
alert_rule_test:
- alertname: PostgreSQLExporterDown
eval_time: 5m
exp_alerts:
- exp_labels:
instance: localhost:9187
severity: critical
exp_annotations:
summary: "Exporter is down"
description: "localhost:9187 exporter is down"
runbook_url: "https://qonto.github.io/database-monitoring-framework/0.0.0/runbooks/postgresql/PostgreSQLExporterDown"
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
rule_files:
- rules.yml

evaluation_interval: 1m

tests:

- name: PostgreSQLExporterErrors
interval: 1m
input_series:
- series: 'pg_exporter_last_scrape_error{job="postgresql-exporter"}'
values: 1+1x10
alert_rule_test:
- alertname: PostgreSQLExporterErrors
eval_time: 5m
exp_alerts:
- exp_labels:
job: postgresql-exporter
severity: critical
exp_annotations:
summary: "Exporter is reporting scraping errors"
description: "postgresql-exporter is reporting scraping errors. Some metrics are not collected anymore"
runbook_url: "https://qonto.github.io/database-monitoring-framework/0.0.0/runbooks/postgresql/PostgreSQLExporterErrors"
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
rule_files:
- rules.yml

evaluation_interval: 1m

tests:

- name: PostgreSQLExporterMissingScrapeErrorMetric
interval: 1m
input_series: []
alert_rule_test:
- alertname: PostgreSQLExporterMissingScrapeErrorMetric
eval_time: 5m
exp_alerts:
- exp_labels:
severity: critical
exp_annotations:
summary: "PostgreSQL exporter last scrape error metric is missing"
description: "PostgreSQL exporter last scrape error metric is missing. Either the exporter is down or some metrics are not collected anymore"
runbook_url: "https://qonto.github.io/database-monitoring-framework/0.0.0/runbooks/postgresql/PostgreSQLExporterMissingScrapeErrorMetric"
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
rule_files:
- rules.yml

evaluation_interval: 1m

tests:

- name: PostgreSQLExporterScrapingLimit
interval: 1m
input_series:
- series: 'pg_exporter_last_scrape_duration_seconds{instance="localhost:9187"}'
values: 40x10
alert_rule_test:
- alertname: PostgreSQLExporterScrapingLimit
eval_time: 5m
exp_alerts:
- exp_labels:
instance: localhost:9187
severity: warning
exp_annotations:
summary: "Exporter scraping take long time"
description: "localhost:9187 scraping take long time"
runbook_url: "https://qonto.github.io/database-monitoring-framework/0.0.0/runbooks/postgresql/PostgreSQLExporterScrapingLimit"
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
rule_files:
- rules.yml

evaluation_interval: 1m

tests:

- name: PostgreSQLInactiveLogicalReplicationSlot
interval: 1m
input_series:
- series: 'pg_replication_slots_active{slot_type="logical", server="db1.unittest.eu-west-3.rds.amazonaws.com:5432", slot_name="unittest"}'
values: 0x10
alert_rule_test:
- alertname: PostgreSQLInactiveLogicalReplicationSlot
eval_time: 10m
exp_alerts:
- exp_labels:
server: db1.unittest.eu-west-3.rds.amazonaws.com:5432
slot_name: unittest
severity: warning
exp_annotations:
summary: "Logical replication slot is inactive"
description: "unittest on db1 is inactive"
runbook_url: "https://qonto.github.io/database-monitoring-framework/0.0.0/runbooks/postgresql/PostgreSQLInactiveLogicalReplicationSlot"
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
rule_files:
- rules.yml

evaluation_interval: 1m

tests:

- name: PostgreSQLInactivePhysicalReplicationSlot
interval: 1m
input_series:
- series: 'pg_replication_slots_active{slot_type="physical", server="db1.unittest.eu-west-3.rds.amazonaws.com:5432", slot_name="unittest"}'
values: 0x10
alert_rule_test:
- alertname: PostgreSQLInactivePhysicalReplicationSlot
eval_time: 10m
exp_alerts:
- exp_labels:
server: db1.unittest.eu-west-3.rds.amazonaws.com:5432
slot_name: unittest
severity: warning
exp_annotations:
summary: "Physical replication slot is inactive"
description: "unittest on db1 is inactive"
runbook_url: "https://qonto.github.io/database-monitoring-framework/0.0.0/runbooks/postgresql/PostgreSQLInactivePhysicalReplicationSlot"
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
rule_files:
- rules.yml

evaluation_interval: 1m

tests:

- name: PostgreSQLInvalidIndex
interval: 1m
input_series:
- series: 'pg_stat_user_indexes_idx_blks_hit{cluster="db1", datname="unittest", relname="test", indexrelname="idx_id", indisvalid="false"}'
values: 1x60
alert_rule_test:
- alertname: PostgreSQLInvalidIndex
eval_time: 1h
exp_alerts:
- exp_labels:
cluster: db1
datname: unittest
relname: test
indexrelname: idx_id
severity: warning
exp_annotations:
summary: "idx_id is invalid"
description: "idx_id of test table on unittest database on db1 is invalid"
runbook_url: "https://qonto.github.io/database-monitoring-framework/0.0.0/runbooks/postgresql/PostgreSQLInvalidIndex"
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
rule_files:
- rules.yml

evaluation_interval: 1m

tests:

- name: PostgreSQLLongRunningQuery
interval: 1m
input_series:
- series: 'pg_active_backend_duration_minutes{server="db1.unittest.eu-west-3.rds.amazonaws.com:5432",datname="unittest",usename="test",pid="1234"}'
values: 40+1x10
alert_rule_test:
- alertname: PostgreSQLLongRunningQuery
eval_time: 1m
exp_alerts:
- exp_labels:
server: db1.unittest.eu-west-3.rds.amazonaws.com:5432
datname: unittest
usename: test
severity: warning
pid: 1234
exp_annotations:
summary: "Long running query on unittest of db1"
description: "test is running a long query on unittest of db1 with pid 1234"
runbook_url: "https://qonto.github.io/database-monitoring-framework/0.0.0/runbooks/postgresql/PostgreSQLLongRunningQuery"
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
rule_files:
- rules.yml

evaluation_interval: 1m

tests:

- name: PostgreSQLMaxConnections
interval: 1m
input_series:
- series: 'pg_stat_connections_count{server="db1.unittest.eu-west-3.rds.amazonaws.com:5432"}'
values: 905x10 # 90.5 that are rendered as 90%
- series: 'pg_settings_max_connections{server="db1.unittest.eu-west-3.rds.amazonaws.com:5432"}'
values: 1000x10
alert_rule_test:
- alertname: PostgreSQLMaxConnections
eval_time: 10m
exp_alerts:
- exp_labels:
server: db1.unittest.eu-west-3.rds.amazonaws.com:5432
severity: warning
exp_annotations:
summary: "db1 is close from the maximum database connections"
description: "db1 uses 90% of the maximum database connections"
runbook_url: "https://qonto.github.io/database-monitoring-framework/0.0.0/runbooks/postgresql/PostgreSQLMaxConnections"
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
rule_files:
- rules.yml

evaluation_interval: 1m

tests:

- name: PostgreSQLReplicationSlotStorageLimit
interval: 1m
input_series:
- series: 'pg_replication_slots_available_storage_percent{server="db1.unittest.eu-west-3.rds.amazonaws.com:5432",slot_name="unittest"}'
values: 15.8x10
alert_rule_test:
- alertname: PostgreSQLReplicationSlotStorageLimit
eval_time: 5m
exp_alerts:
- exp_labels:
server: db1.unittest.eu-west-3.rds.amazonaws.com:5432
slot_name: unittest
severity: warning
exp_annotations:
summary: "unittest on db1 is close to its storage limit"
description: "unittest has 16% free disk storage space"
runbook_url: "https://qonto.github.io/database-monitoring-framework/0.0.0/runbooks/postgresql/PostgreSQLReplicationSlotStorageLimit"
12 changes: 6 additions & 6 deletions charts/prometheus-postgresql-alerts/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -54,8 +54,8 @@ rules:
labels:
severity: warning
annotations:
summary: "{{ $labels.server }} is close from the maximum database connections"
description: '{{ $labels.server }} uses {{ printf "%.2g" $value }}% of database connections'
summary: "{{ $labels.server | stripDomain | stripPort }} is close from the maximum database connections"
description: '{{ $labels.server | stripDomain | stripPort }} uses {{ printf "%.2g" $value }}% of the maximum database connections'

PostgreSQLReplicationSlotStorageLimit:
expr: max by (server, slot_name) (pg_replication_slots_available_storage_percent{}) < 20
Expand All @@ -64,7 +64,7 @@ rules:
severity: warning
annotations:
summary: "{{ $labels.slot_name }} on {{ $labels.server | stripDomain | stripPort }} is close to its storage limit"
description: '{{ $labels.slot_name }} uses {{ printf "%.2g" $value }}% of its storage limit'
description: '{{ $labels.slot_name }} has {{ printf "%.2g" $value }}% free disk storage space'

PostgreSQLInactiveLogicalReplicationSlot:
expr: max by (server, slot_name) (pg_replication_slots_active{slot_type="logical"}) < 1
Expand All @@ -85,12 +85,12 @@ rules:
description: "{{ $labels.slot_name }} on {{ $labels.server | stripDomain | stripPort }} is inactive"

PostgreSQLLongRunningQuery:
expr: max by (server, datname, usename) (pg_active_backend_duration_minutes{usename!=""}) > 30
expr: max by (server, datname, usename, pid) (pg_active_backend_duration_minutes{usename!=""}) > 30
for: 1m
labels:
severity: warning
annotations:
summary: "Long running query on {{ $labels.datname }} of {{ $labels.server | stripPort }}."
summary: "Long running query on {{ $labels.datname }} of {{ $labels.server | stripDomain | stripPort }}"
description: "{{ $labels.usename }} is running a long query on {{ $labels.datname }} of {{ $labels.server | stripDomain | stripPort }} with pid {{ $labels.pid }}"
pintComments:
- disable promql/series
Expand All @@ -102,6 +102,6 @@ rules:
severity: warning
annotations:
summary: "{{ $labels.indexrelname }} is invalid"
description: "{{ $labels.indexrelname }} of {{ $labels.relname }} table on {{ $labels.datname }} database on {{ $labels.server }} is invalid"
description: "{{ $labels.indexrelname }} of {{ $labels.relname }} table on {{ $labels.datname }} database on {{ $labels.cluster }} is invalid"
pintComments:
- disable promql/series
1 change: 1 addition & 0 deletions charts/prometheus-rds-alerts/.helmignore
Original file line number Diff line number Diff line change
@@ -1 +1,2 @@
tests
prometheus_tests
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
rule_files:
- rules.yml

evaluation_interval: 1m

tests:

- name: RDSCPUUtilization
interval: 1m
input_series:
- series: 'rds_cpu_usage_percent_average{aws_account_id="111111111111",aws_region="eu-west-3",dbidentifier="db1"}'
values: '90x10'
alert_rule_test:
- alertname: RDSCPUUtilization
eval_time: 10m
exp_alerts:
- exp_labels:
aws_account_id: 111111111111
aws_region: eu-west-3
dbidentifier: db1
severity: warning
exp_annotations:
description: "db1 has 90% CPU used"
summary: "db1 has high CPU utilization"
runbook_url: "https://qonto.github.io/database-monitoring-framework/0.0.0/runbooks/rds/RDSCPUUtilization"
Loading

0 comments on commit 351310e

Please sign in to comment.