Add absent alerts #50

devopsjonas · 2019-04-24T04:59:16Z

This is part of moving away from https://github.com/devopyio/ceph-monitoring-mixin to this community repo :)

This is how generated alerting rules look:

"groups":
- "name": "ceph-mgr-status"
  "rules":
  - "alert": "CephMgrIsAbsent"
    "annotations":
      "description": "Ceph Manager has disappeared from Prometheus target discovery."
      "message": "Storage metrics collector service not available anymore."
      "severity_level": "warning"
      "storage_type": "ceph"
    "expr": |
      absent(up{job="rook-ceph-mgr"} == 1)
    "for": "5m"
    "labels":
      "severity": "warning"
  - "alert": "CephMgrIsMissingReplicas"
    "annotations":
      "description": "Ceph Manager is missing replicas."
      "message": "Storage metrics collector service not available anymore."
      "severity_level": "warning"
      "storage_type": "ceph"
    "expr": |
      sum(up{job="rook-ceph-mgr"}) != 3
    "for": "5m"
    "labels":
      "severity": "warning"

anmolsachan · 2019-05-02T09:25:45Z

@devopsjonas Thanks for the contribution. I was curious to know on how you tested your patches ?
And it would be great if you could add a screenshot.

alerts/absent_alerts.libsonnet

devopsjonas · 2019-05-03T05:53:57Z

@devopsjonas Thanks for the contribution. I was curious to know on how you tested your patches ?
And it would be great if you could add a screenshot.

@anmolsachan Sure we do it's in prod for one of our clients, I'll add a screenshot for Prometheus UI if that works for you?

Should I do it for all of my PRs and future PRs?

anmolsachan · 2019-05-03T11:59:27Z

Should I do it for all of my PRs and future PRs?

@devopsjonas That would be great

alerts/absent_alerts.libsonnet

shtripat · 2019-05-03T12:04:55Z

config.libsonnet

@@ -3,6 +3,22 @@
    // Selectors are inserted between {} in Prometheus queries.
    cephExporterSelector: 'job="rook-ceph-mgr"',

+    // Number of Ceph Managers which are reporting metrics
+    cephMgrCount: 3,


Are we always expecting 3 MGR?

So the alert checks if x number of replicas are running.

So cephMgrCount is a config option, where users set how many cephMgrs they are running. It is a config option, so users are expect to change it.

So should we change the default? If yes, what should the number be? Or drop alert entirely?

From ceph team I heard that one of the mons ceph-mgr runs?
@leseb does Rook provide and option to decide the no of manager pods?

If we want this config to be there. Lets have it as optional config. We can comment it out and if during the actual deployment the user can uncomment it. @shtripat @devopsjonas what do you think ?

Certainly it should not be default configuration I feel.

Maybe, instead we should just do sum(up{%(cephExporterSelector)s}) < %(cephMgrCount)d and set it to 1 by default. So that becomes sum(up{job="ceph-mgr"}) < 1, which means if we are not scraping metrics we would get alerted. This would work for everyone. WDYT?

I've changed the default to be 1 and alert condition to fire than it's less thant that.

config.libsonnet

devopsjonas · 2019-05-04T05:50:39Z

@shtripat @anmolsachan please take another look. Thank you!

umangachapagain · 2019-05-09T08:03:58Z

@devopsjonas Any specific reason why the alerts are in different groups and not in the same alerts group say ceph-mgr-status ?

devopsjonas · 2019-05-11T05:08:37Z

@devopsjonas Any specific reason why the alerts are in different groups and not in the same alerts group say ceph-mgr-status ?

nice catch 🥇 fixed

alerts/absent_alerts.libsonnet

devopsjonas · 2019-05-15T14:41:43Z

@shtripat @anmolsachan please take another look 🙇‍♂️

shtripat

LGTM.

shtripat · 2019-05-17T06:13:24Z

Merging this now. Please add another PR for unit tests.

anmolsachan reviewed May 2, 2019

View reviewed changes

alerts/absent_alerts.libsonnet Outdated Show resolved Hide resolved

shtripat reviewed May 3, 2019

View reviewed changes

devopsjonas added 4 commits May 11, 2019 08:35

Add absent alerts

2396405

Fixes after review

8a89011

Make alert time configurable

85014a6

Move alerts into ceph-mgr-status group

fe2ca9d

shtripat reviewed May 15, 2019

View reviewed changes

alerts/absent_alerts.libsonnet Outdated Show resolved Hide resolved

shtripat mentioned this pull request May 15, 2019

Added alert for absent rook-ceph-mgr service #57

Closed

Fixes after review

501fb48

devopsjonas force-pushed the add-absent branch from 2df70a9 to 501fb48 Compare May 15, 2019 14:39

shtripat approved these changes May 16, 2019

View reviewed changes

shtripat merged commit aecf25d into ceph:master May 17, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add absent alerts #50

Add absent alerts #50

devopsjonas commented Apr 24, 2019 •

edited

Loading

anmolsachan commented May 2, 2019

devopsjonas commented May 3, 2019 •

edited

Loading

anmolsachan commented May 3, 2019

shtripat May 3, 2019

devopsjonas May 4, 2019

shtripat May 6, 2019

anmolsachan May 9, 2019

shtripat May 9, 2019

devopsjonas May 11, 2019

devopsjonas May 15, 2019

devopsjonas commented May 4, 2019

umangachapagain commented May 9, 2019

devopsjonas commented May 11, 2019

devopsjonas commented May 15, 2019

shtripat left a comment

shtripat commented May 17, 2019

Add absent alerts #50

Add absent alerts #50

Conversation

devopsjonas commented Apr 24, 2019 • edited Loading

anmolsachan commented May 2, 2019

devopsjonas commented May 3, 2019 • edited Loading

anmolsachan commented May 3, 2019

shtripat May 3, 2019

Choose a reason for hiding this comment

devopsjonas May 4, 2019

Choose a reason for hiding this comment

shtripat May 6, 2019

Choose a reason for hiding this comment

anmolsachan May 9, 2019

Choose a reason for hiding this comment

shtripat May 9, 2019

Choose a reason for hiding this comment

devopsjonas May 11, 2019

Choose a reason for hiding this comment

devopsjonas May 15, 2019

Choose a reason for hiding this comment

devopsjonas commented May 4, 2019

umangachapagain commented May 9, 2019

devopsjonas commented May 11, 2019

devopsjonas commented May 15, 2019

shtripat left a comment

Choose a reason for hiding this comment

shtripat commented May 17, 2019

devopsjonas commented Apr 24, 2019 •

edited

Loading

devopsjonas commented May 3, 2019 •

edited

Loading