Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use prometheus conventions for workqueue metrics #71300

Merged
merged 2 commits into from
Jan 1, 2019

Conversation

danielqsj
Copy link
Contributor

@danielqsj danielqsj commented Nov 21, 2018

What type of PR is this?
/kind feature
/sig api-machinery

What this PR does / why we need it:
Use prometheus conventions for workqueue metrics

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #71165

Special notes for your reviewer:
This patch does not remove the existing metrics but mark them as deprecated.
We need 2 releases for users to convert monitoring configuration.

Does this PR introduce a user-facing change?:

Use prometheus conventions for workqueue metrics.
It is now deprecated to use the following metrics:
* `{WorkQueueName}_depth`
* `{WorkQueueName}_adds`
* `{WorkQueueName}_queue_latency`
* `{WorkQueueName}_work_duration`
* `{WorkQueueName}_unfinished_work_seconds`
* `{WorkQueueName}_longest_running_processor_microseconds`
* `{WorkQueueName}_retries`
Please convert to the following metrics:
* `workqueue_depth`
* `workqueue_adds_total`
* `workqueue_queue_latency_seconds`
* `workqueue_work_duration_seconds`
* `workqueue_unfinished_work_seconds`
* `workqueue_longest_running_processor_seconds`
* `workqueue_retries_total`

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Nov 21, 2018
@k8s-ci-robot k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Nov 21, 2018
@danielqsj
Copy link
Contributor Author

/assign @mortent

QueueLatencyKey = "queue_latency_microseconds"
WorkDurationKey = "work_duration_microseconds"
UnfinishedWorkKey = "unfinished_work_seconds"
LongestRunningProcessorKey = "longest_running_processor_microseconds"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should stick with one common unit for all the metrics for workqueue and not mix seconds and microseconds. Seconds is one the base units suggested in the prometheus docs (https://prometheus.io/docs/practices/naming/), so I think we can use that unless we have a good reason to use microseconds.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. Changed unit to seconds. PTAL

@mortent
Copy link
Member

mortent commented Nov 24, 2018

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 24, 2018
@jennybuckley
Copy link

/cc @logicalhan

@k8s-ci-robot
Copy link
Contributor

@jennybuckley: GitHub didn't allow me to request PR reviews from the following users: logicalhan.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @logicalhan

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@mortent
Copy link
Member

mortent commented Nov 29, 2018

/assign @smarterclayton

Copy link
Member

@logicalhan logicalhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize you did not create the files but since you are touching rate_limitting_queue_test.go, would you mind renaming rate_limitting_queue.go and rate_limitting_queue_test.go? Limitting is a typo.

@danielqsj
Copy link
Contributor Author

@logicalhan good catch, let's discuss it in #71683 or #71684 .

@yue9944882
Copy link
Member

/remove-sig api-machinery
/sig instrumentation

@k8s-ci-robot k8s-ci-robot added sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. and removed sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. labels Dec 5, 2018
return adds
}

func (prometheusMetricsProvider) NewLatencyMetric(name string) workqueue.SummaryMetric {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should stop using Summary metrics, please use Histogram instead.

Summary metrics can't be aggregated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@loburm what buckets do you prefer or just ignore it now?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with those queues here. I remember that default is for example almost useless for kube-apiserver requests latency, because most of samples are going to first few buckets and wasn't giving enough information for measuring it.

Usually I prefer near 20 reasonable buckets. Let's ask for advice from someone sig-instrumentation.

@DirectXMan12 @brancz

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this is on internal queues, the latencies should be rather small, I'd suggest something along the lines of:

prometheus.ExponentialBuckets(10e-9, 10, 10)

That gives us exponential buckets from 1 nanosecond to 10 seconds.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure that the best approach. I would check current values from few kube-apiservers and select range base on it.

Copy link
Member

@brancz brancz Dec 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are histograms for queues in the apiserver, then yes we should be consistent, latency histograms for api requests (as in a service that performs network requests) are very different though from queues. Queues should be substantially faster.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@loburm have the conclusion about the buckets ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any data about current distribution of those samples? But if you are happy with buckets:
1ns - 10ns
10ns - 100ns...
1s - 10s
Then you can proposal about.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After checking current samples, I agree with your proposal.
Code fixed. PTAL @loburm @brancz

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 7, 2018
@k8s-ci-robot k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Dec 12, 2018
@loburm
Copy link
Contributor

loburm commented Dec 12, 2018

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 12, 2018
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 12, 2018
@danielqsj
Copy link
Contributor Author

@loburm @mortent @brancz after format code, need a new LGTM, thanks

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 12, 2018
Copy link
Contributor

@loburm loburm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@brancz
Copy link
Member

brancz commented Dec 14, 2018

/lgtm
/approve

Could you also do a PR to add this to the metrics overhaul KEP? I just want to make sure we keep everything in one place documented.

@smarterclayton
Copy link
Contributor

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: brancz, danielqsj, smarterclayton

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 14, 2018
@ash2k
Copy link
Member

ash2k commented Jan 1, 2019

/test pull-kubernetes-godeps

@brancz
Copy link
Member

brancz commented Jan 10, 2019

@danielqsj can you make sure to create a follow up for this one to add the deprecation notice in the help text for the metrics deprecated in this PR as well? Thanks!

@danielqsj
Copy link
Contributor Author

@brancz yes, this PR #72679 aims to mark these deprecated metrics.

@MikeSpreitzer
Copy link
Member

YAY on the change from Summary to Histogram. (Although, I would have liked a little finer granularity in the buckets, or control over the buckets.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use prometheus conventions for workqueue metrics