Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Support For Prometheus Metrics #348

Closed
seanmalloy opened this issue Jul 15, 2020 · 16 comments · Fixed by #505
Closed

Add Support For Prometheus Metrics #348

seanmalloy opened this issue Jul 15, 2020 · 16 comments · Fixed by #505
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/feature Categorizes issue or PR as related to a new feature.

Comments

@seanmalloy
Copy link
Member

seanmalloy commented Jul 15, 2020

Is your feature request related to a problem? Please describe.

I'd like to monitor the descheduler using Prometheus metrics. Metrics should only be collected when run without the --dry-run CLI option.

Describe the solution you'd like

Add an optional configuration option(CLI option?) to enable exposing Prometheus metrics. At this point in time I have not given much thought to exactly which metrics and labels the descheduler would expose.

Describe alternatives you've considered

I have considered using the descheduler logs with log aggregation software for monitoring the descheduler. I also considered using the k8s events generated by the descheduler in combination with eventrouter. The eventrouter does provide some simple Prometheus metrics for events that it collects.

What version of descheduler are you using?

descheduler version: v0.18.0

Additional context

Currently the descheduler is run as a Job or CronJob this makes it difficult for Prometheus to scrape metrics from the descheduler pod because it is not long lived. One option is to have descheduler push metrics using the Prometheus pushgateway.

@seanmalloy seanmalloy added the kind/feature Categorizes issue or PR as related to a new feature. label Jul 15, 2020
@damemi
Copy link
Contributor

damemi commented Jul 27, 2020

Currently the descheduler is run as a Job or CronJob this makes it difficult for Prometheus to scrape metrics from the descheduler pod because it is not long lived

The scheduler can run as a long lived pod with the --descheduling-interval flag. This is actually what we chose to prefer in the OpenShift Descheduler Operator because it also allowed us to not yet depend on CronJobs (which are still beta)

@farah
Copy link
Contributor

farah commented Aug 31, 2020

/assign

@ingvagabund
Copy link
Contributor

@seanmalloy any particular metrics in mind? I can quickly think of:

  • number of pods evicted per each strategy
  • number of pods failed to be evicted per each strategy

@seanmalloy
Copy link
Member Author

@seanmalloy any particular metrics in mind? I can quickly think of:

  • number of pods evicted per each strategy
  • number of pods failed to be evicted per each strategy

@ingvagabund yep that is pretty much it. See below for some additional details. I'm open to suggestions.

Metrics

name type description
build_info gauge constant 1
pods_evicted_success counter total number of pods successfully evicted
pods_evicted_failed counter total number of pods failed eviction

The metric names can be changed. There is probably some k8s best practice for naming Prometheus metrics. We should follow those guidelines to determine the names of the metrics.

Labels

The pods_evicted_success and pods_evicted_failed metrics should have these labels.

  • namespace
  • descheduler strategy name

The build_info metric should have these labels.

  • Go version
  • Descheduler version
  • Git SHA1
  • Git Branch

@seanmalloy
Copy link
Member Author

One thing I forgot...

We need a separate metric or maybe a label to differentiate when using --dry-run CLI option and not using the --dry-run CLI option. I'd like to be able to collect metrics and see what descheduler would do without evicting any pods.

@ingvagabund
Copy link
Contributor

With --dry-run set you will be evicting the same pods all over again. Will the metrics be still useful for you (e.g. with the same pod counted 1000 times)

@seanmalloy
Copy link
Member Author

With --dry-run set you will be evicting the same pods all over again. Will the metrics be still useful for you (e.g. with the same pod counted 1000 times)

@ingvagabund after thinking about it I believe you are correct. I edited the original description to remove the requirement of collecting metrics when the --dry-run CLI option is used.

@seanmalloy
Copy link
Member Author

@farah I see that this issue is currently assigned to you. I'm just checking in to see if you have any questions or need any assistance. We are trying to complete this feature as part of the descheduler v0.20.0 release which should be sometime in December.

Thanks!

@eatwithforks
Copy link
Contributor

👍 for this feature. @seanmalloy I see v0.20.0 was released 6 days ago but didn't see mentions of prometheus metrics collection in the changelog. Is this feature still in the works and do you have an estimate on ETA?

@seanmalloy
Copy link
Member Author

👍 for this feature. @seanmalloy I see v0.20.0 was released 6 days ago but didn't see mentions of prometheus metrics collection in the changelog. Is this feature still in the works and do you have an estimate on ETA?

This feature was not implemented for v0.20.0. Maybe for the v0.21.0 release. For sure I'm hoping sometime in 2021.

@damemi
Copy link
Contributor

damemi commented Dec 16, 2020

Is there currently any PRs open for this? The last activity was a couple months ago so it may be safe to assume this is open for anyone interested in contributing!

@seanmalloy
Copy link
Member Author

Is there currently any PRs open for this? The last activity was a couple months ago so it may be safe to assume this is open for anyone interested in contributing!

Correct. Basically we need someone to volunteer to write the code to implement this feature. It's on my extended todo list, but I don't have time to do this right now. So not assigning myself at the moment.

/help

@k8s-ci-robot
Copy link
Contributor

@seanmalloy:
This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

Is there currently any PRs open for this? The last activity was a couple months ago so it may be safe to assume this is open for anyone interested in contributing!

Correct. Basically we need someone to volunteer to write the code to implement this feature. It's on my extended todo list, but I don't have time to do this right now. So not assigning myself at the moment.

/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Dec 17, 2020
@damemi
Copy link
Contributor

damemi commented Feb 19, 2021

A good first step for implementation would be adding some of these metrics to podEvictor (#503). I'll have some time to work on this, unless someone else is interested

@ingvagabund
Copy link
Contributor

WIP PR for discussion: #505

@seanmalloy
Copy link
Member Author

This feature will ship with descheduler v0.21.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants