Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add absent alerts #50

Merged
merged 5 commits into from
May 17, 2019
Merged

Add absent alerts #50

merged 5 commits into from
May 17, 2019

Conversation

devopsjonas
Copy link
Contributor

@devopsjonas devopsjonas commented Apr 24, 2019

This is part of moving away from https://github.com/devopyio/ceph-monitoring-mixin to this community repo :)

This is how generated alerting rules look:

"groups":
- "name": "ceph-mgr-status"
  "rules":
  - "alert": "CephMgrIsAbsent"
    "annotations":
      "description": "Ceph Manager has disappeared from Prometheus target discovery."
      "message": "Storage metrics collector service not available anymore."
      "severity_level": "warning"
      "storage_type": "ceph"
    "expr": |
      absent(up{job="rook-ceph-mgr"} == 1)
    "for": "5m"
    "labels":
      "severity": "warning"
  - "alert": "CephMgrIsMissingReplicas"
    "annotations":
      "description": "Ceph Manager is missing replicas."
      "message": "Storage metrics collector service not available anymore."
      "severity_level": "warning"
      "storage_type": "ceph"
    "expr": |
      sum(up{job="rook-ceph-mgr"}) != 3
    "for": "5m"
    "labels":
      "severity": "warning"

@anmolsachan
Copy link
Collaborator

@devopsjonas Thanks for the contribution. I was curious to know on how you tested your patches ?
And it would be great if you could add a screenshot.

@devopsjonas
Copy link
Contributor Author

devopsjonas commented May 3, 2019

@devopsjonas Thanks for the contribution. I was curious to know on how you tested your patches ?
And it would be great if you could add a screenshot.

@anmolsachan Sure we do it's in prod for one of our clients, I'll add a screenshot for Prometheus UI if that works for you?

Should I do it for all of my PRs and future PRs?

@anmolsachan
Copy link
Collaborator

Should I do it for all of my PRs and future PRs?

@devopsjonas That would be great

alerts/absent_alerts.libsonnet Outdated Show resolved Hide resolved
config.libsonnet Outdated
@@ -3,6 +3,22 @@
// Selectors are inserted between {} in Prometheus queries.
cephExporterSelector: 'job="rook-ceph-mgr"',

// Number of Ceph Managers which are reporting metrics
cephMgrCount: 3,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we always expecting 3 MGR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the alert checks if x number of replicas are running.

So cephMgrCount is a config option, where users set how many cephMgrs they are running. It is a config option, so users are expect to change it.

So should we change the default? If yes, what should the number be? Or drop alert entirely?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From ceph team I heard that one of the mons ceph-mgr runs?
@leseb does Rook provide and option to decide the no of manager pods?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want this config to be there. Lets have it as optional config. We can comment it out and if during the actual deployment the user can uncomment it. @shtripat @devopsjonas what do you think ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Certainly it should not be default configuration I feel.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, instead we should just do sum(up{%(cephExporterSelector)s}) < %(cephMgrCount)d and set it to 1 by default. So that becomes sum(up{job="ceph-mgr"}) < 1, which means if we are not scraping metrics we would get alerted. This would work for everyone. WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed the default to be 1 and alert condition to fire than it's less thant that.

config.libsonnet Outdated Show resolved Hide resolved
@devopsjonas
Copy link
Contributor Author

@shtripat @anmolsachan please take another look. Thank you!

@umangachapagain
Copy link
Collaborator

@devopsjonas Any specific reason why the alerts are in different groups and not in the same alerts group say ceph-mgr-status ?

@devopsjonas
Copy link
Contributor Author

@devopsjonas Any specific reason why the alerts are in different groups and not in the same alerts group say ceph-mgr-status ?

nice catch 🥇 fixed

@devopsjonas
Copy link
Contributor Author

@shtripat @anmolsachan please take another look 🙇‍♂️

Copy link
Contributor

@shtripat shtripat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@shtripat
Copy link
Contributor

Merging this now. Please add another PR for unit tests.

@shtripat shtripat merged commit aecf25d into ceph:master May 17, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants