Skip to content

Commit

Permalink
mixin: Use sidecar's metric timestamp for healthcheck
Browse files Browse the repository at this point in the history
During prometheus updates the alert was firing because the metric was
initialized with a value of '0' before the first heartbeat was sent. As
such, the evaluation of the alert results into actually taking just the
value of time() into consideration which led to misleading information
about the health of the sidecar.

As the thanos_sidecar_last_heartbeat_success_time_seconds metric is
effectively just a timestamp that resets on new deployments, we can
simply wrap it around the timestamp() function which should return
almost the same value of the metric itself with the added benefit that
heartbeat resets will be ignored.

This also refactors the relevant tests and drops the timeout to 4
minutes in order to ensure that we do not get hit by stale data if
the sidecar takes longer to start.

Signed-off-by: Markos Chandras <[email protected]>
  • Loading branch information
hwoarang committed Feb 12, 2021
1 parent 7b09e30 commit 7dba321
Show file tree
Hide file tree
Showing 5 changed files with 56 additions and 101 deletions.
3 changes: 3 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,9 @@ We use _breaking :warning:_ to mark changes that are not backward compatible (re

- [#3705](https://github.com/thanos-io/thanos/pull/3705) Store: Fix race condition leading to failing queries or possibly incorrect query results.

### Fixed
- [#3204](https://github.com/thanos-io/thanos/pull/3204) Mixin: Use sidecar's metric timestamp for healthcheck.

## [v0.18.0](https://github.com/thanos-io/thanos/releases) - Release in progress

### Added
Expand Down
4 changes: 2 additions & 2 deletions examples/alerts/alerts.md
Original file line number Diff line number Diff line change
Expand Up @@ -327,12 +327,12 @@ rules:
severity: critical
- alert: ThanosSidecarUnhealthy
annotations:
description: Thanos Sidecar {{$labels.job}} {{$labels.pod}} is unhealthy for {{
description: Thanos Sidecar {{$labels.job}} {{$labels.pod}} is unhealthy for more than {{
$value }} seconds.
runbook_url: https://github.com/thanos-io/thanos/tree/master/mixin/runbook.md#alert-name-thanossidecarunhealthy
summary: Thanos Sidecar is unhealthy.
expr: |
time() - max(thanos_sidecar_last_heartbeat_success_time_seconds{job=~"thanos-sidecar.*"}) by (job, pod) >= 600
time() - max(timestamp(thanos_sidecar_last_heartbeat_success_time_seconds{job=~"thanos-sidecar.*"})) by (job, pod) >= 600
labels:
severity: critical
```
Expand Down
4 changes: 2 additions & 2 deletions examples/alerts/alerts.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -309,11 +309,11 @@ groups:
- alert: ThanosSidecarUnhealthy
annotations:
description: Thanos Sidecar {{$labels.job}} {{$labels.pod}} is unhealthy for
{{ $value }} seconds.
more than {{ $value }} seconds.
runbook_url: https://github.com/thanos-io/thanos/tree/master/mixin/runbook.md#alert-name-thanossidecarunhealthy
summary: Thanos Sidecar is unhealthy.
expr: |
time() - max(thanos_sidecar_last_heartbeat_success_time_seconds{job=~"thanos-sidecar.*"}) by (job, pod) >= 600
time() - max(timestamp(thanos_sidecar_last_heartbeat_success_time_seconds{job=~"thanos-sidecar.*"})) by (job,pod) >= 240
labels:
severity: critical
- name: thanos-store
Expand Down
142 changes: 47 additions & 95 deletions examples/alerts/tests.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -8,9 +8,9 @@ tests:
- interval: 1m
input_series:
- series: 'thanos_sidecar_last_heartbeat_success_time_seconds{namespace="production", job="thanos-sidecar", pod="thanos-sidecar-pod-0"}'
values: '5 10 43 17 11 0 0 0'
values: '5 10 43 17 11 _x5 0x10'
- series: 'thanos_sidecar_last_heartbeat_success_time_seconds{namespace="production", job="thanos-sidecar", pod="thanos-sidecar-pod-1"}'
values: '4 9 42 15 10 0 0 0'
values: '4 9 42 15 10 _x5 0x10'
promql_expr_test:
- expr: time()
eval_time: 1m
Expand All @@ -22,109 +22,61 @@ tests:
exp_samples:
- labels: '{}'
value: 120
- expr: max(thanos_sidecar_last_heartbeat_success_time_seconds{job="thanos-sidecar"}) by (job, pod)
eval_time: 2m
exp_samples:
- labels: '{job="thanos-sidecar", pod="thanos-sidecar-pod-0"}'
value: 43
- labels: '{job="thanos-sidecar", pod="thanos-sidecar-pod-1"}'
value: 42
- expr: max(thanos_sidecar_last_heartbeat_success_time_seconds{job="thanos-sidecar"}) by (job, pod)
eval_time: 10m
- expr: time() - max(timestamp(thanos_sidecar_last_heartbeat_success_time_seconds{job="thanos-sidecar"})) by (job, pod)
eval_time: 5m
exp_samples:
- labels: '{job="thanos-sidecar", pod="thanos-sidecar-pod-0"}'
value: 0
- labels: '{job="thanos-sidecar", pod="thanos-sidecar-pod-1"}'
value: 0
- expr: max(thanos_sidecar_last_heartbeat_success_time_seconds{job="thanos-sidecar"}) by (job, pod)
eval_time: 11m
exp_samples:
- labels: '{job="thanos-sidecar", pod="thanos-sidecar-pod-0"}'
value: 0
value: 60
- labels: '{job="thanos-sidecar", pod="thanos-sidecar-pod-1"}'
value: 0
- expr: time() - max(thanos_sidecar_last_heartbeat_success_time_seconds{job="thanos-sidecar"}) by (job, pod)
eval_time: 10m
value: 60
- expr: time() - max(timestamp(thanos_sidecar_last_heartbeat_success_time_seconds{job="thanos-sidecar"})) by (job, pod)
eval_time: 6m
exp_samples:
- labels: '{job="thanos-sidecar", pod="thanos-sidecar-pod-0"}'
value: 600
value: 120
- labels: '{job="thanos-sidecar", pod="thanos-sidecar-pod-1"}'
value: 600
- expr: time() - max(thanos_sidecar_last_heartbeat_success_time_seconds{job="thanos-sidecar"}) by (job, pod)
eval_time: 11m
value: 120
- expr: time() - max(timestamp(thanos_sidecar_last_heartbeat_success_time_seconds{job="thanos-sidecar"})) by (job, pod)
eval_time: 7m
exp_samples:
- labels: '{job="thanos-sidecar", pod="thanos-sidecar-pod-0"}'
value: 660
value: 180
- labels: '{job="thanos-sidecar", pod="thanos-sidecar-pod-1"}'
value: 660
- expr: time() - max(thanos_sidecar_last_heartbeat_success_time_seconds{job="thanos-sidecar"}) by (job, pod) >= 600
eval_time: 12m
value: 180
- expr: time() - max(timestamp(thanos_sidecar_last_heartbeat_success_time_seconds{job="thanos-sidecar"})) by (job, pod)
eval_time: 8m
exp_samples:
- labels: '{job="thanos-sidecar", pod="thanos-sidecar-pod-0"}'
value: 720
value: 240
- labels: '{job="thanos-sidecar", pod="thanos-sidecar-pod-1"}'
value: 720
value: 240
alert_rule_test:
- eval_time: 1m
alertname: ThanosSidecarUnhealthy
- eval_time: 2m
alertname: ThanosSidecarUnhealthy
- eval_time: 3m
alertname: ThanosSidecarUnhealthy
- eval_time: 10m
alertname: ThanosSidecarUnhealthy
exp_alerts:
- exp_labels:
severity: critical
job: thanos-sidecar
pod: thanos-sidecar-pod-0
exp_annotations:
description: 'Thanos Sidecar thanos-sidecar thanos-sidecar-pod-0 is unhealthy for 600 seconds.'
runbook_url: 'https://github.com/thanos-io/thanos/tree/master/mixin/runbook.md#alert-name-thanossidecarunhealthy'
summary: 'Thanos Sidecar is unhealthy.'
- exp_labels:
severity: critical
job: thanos-sidecar
pod: thanos-sidecar-pod-1
exp_annotations:
description: 'Thanos Sidecar thanos-sidecar thanos-sidecar-pod-1 is unhealthy for 600 seconds.'
runbook_url: 'https://github.com/thanos-io/thanos/tree/master/mixin/runbook.md#alert-name-thanossidecarunhealthy'
summary: 'Thanos Sidecar is unhealthy.'
- eval_time: 11m
alertname: ThanosSidecarUnhealthy
exp_alerts:
- exp_labels:
severity: critical
job: thanos-sidecar
pod: thanos-sidecar-pod-0
exp_annotations:
description: 'Thanos Sidecar thanos-sidecar thanos-sidecar-pod-0 is unhealthy for 660 seconds.'
runbook_url: 'https://github.com/thanos-io/thanos/tree/master/mixin/runbook.md#alert-name-thanossidecarunhealthy'
summary: 'Thanos Sidecar is unhealthy.'
- exp_labels:
severity: critical
job: thanos-sidecar
pod: thanos-sidecar-pod-1
exp_annotations:
description: 'Thanos Sidecar thanos-sidecar thanos-sidecar-pod-1 is unhealthy for 660 seconds.'
runbook_url: 'https://github.com/thanos-io/thanos/tree/master/mixin/runbook.md#alert-name-thanossidecarunhealthy'
summary: 'Thanos Sidecar is unhealthy.'
- eval_time: 12m
alertname: ThanosSidecarUnhealthy
exp_alerts:
- exp_labels:
severity: critical
job: thanos-sidecar
pod: thanos-sidecar-pod-0
exp_annotations:
description: 'Thanos Sidecar thanos-sidecar thanos-sidecar-pod-0 is unhealthy for 720 seconds.'
runbook_url: 'https://github.com/thanos-io/thanos/tree/master/mixin/runbook.md#alert-name-thanossidecarunhealthy'
summary: 'Thanos Sidecar is unhealthy.'
- exp_labels:
severity: critical
job: thanos-sidecar
pod: thanos-sidecar-pod-1
exp_annotations:
description: 'Thanos Sidecar thanos-sidecar thanos-sidecar-pod-1 is unhealthy for 720 seconds.'
runbook_url: 'https://github.com/thanos-io/thanos/tree/master/mixin/runbook.md#alert-name-thanossidecarunhealthy'
summary: 'Thanos Sidecar is unhealthy.'
- eval_time: 1m
alertname: ThanosSidecarUnhealthy
- eval_time: 2m
alertname: ThanosSidecarUnhealthy
- eval_time: 3m
alertname: ThanosSidecarUnhealthy
- eval_time: 5m
alertname: ThanosSidecarUnhealthy
- eval_time: 8m
alertname: ThanosSidecarUnhealthy
exp_alerts:
- exp_labels:
severity: critical
job: thanos-sidecar
pod: thanos-sidecar-pod-0
exp_annotations:
description: 'Thanos Sidecar thanos-sidecar thanos-sidecar-pod-0 is unhealthy for more than 240 seconds.'
runbook_url: 'https://github.com/thanos-io/thanos/tree/master/mixin/runbook.md#alert-name-thanossidecarunhealthy'
summary: 'Thanos Sidecar is unhealthy.'
- exp_labels:
severity: critical
job: thanos-sidecar
pod: thanos-sidecar-pod-1
exp_annotations:
description: 'Thanos Sidecar thanos-sidecar thanos-sidecar-pod-1 is unhealthy for more than 240 seconds.'
runbook_url: 'https://github.com/thanos-io/thanos/tree/master/mixin/runbook.md#alert-name-thanossidecarunhealthy'
summary: 'Thanos Sidecar is unhealthy.'
- eval_time: 10m
alertname: ThanosSidecarUnhealthy
4 changes: 2 additions & 2 deletions mixin/alerts/sidecar.libsonnet
Original file line number Diff line number Diff line change
Expand Up @@ -39,11 +39,11 @@
{
alert: 'ThanosSidecarUnhealthy',
annotations: {
description: 'Thanos Sidecar {{$labels.job}} {{$labels.pod}} is unhealthy for {{ $value }} seconds.',
description: 'Thanos Sidecar {{$labels.job}} {{$labels.pod}} is unhealthy for more than {{ $value }} seconds.',
summary: 'Thanos Sidecar is unhealthy.',
},
expr: |||
time() - max(thanos_sidecar_last_heartbeat_success_time_seconds{%(selector)s}) by (job, pod) >= 600
time() - max(timestamp(thanos_sidecar_last_heartbeat_success_time_seconds{%(selector)s})) by (job,pod) >= 240
||| % thanos.sidecar,
labels: {
severity: 'critical',
Expand Down

0 comments on commit 7dba321

Please sign in to comment.