Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

leases: Add metrics to etcd leases #9778

Merged
merged 1 commit into from
May 25, 2018

Conversation

idiamond-stripe
Copy link
Contributor

This patch adds four metrics (leases granted, leases revoked, leases renewed, and the average TTL of leases) to the leases package for easier debugging and visibility into outstanding leases.

There's not currently a way to monitor the number of leases that are granted or revoked within etcd. It's currently possible to see the number of leases expired via server_lease_expired_total, but not the number initially granted nor those being renewed.

This makes it tough to determine the current number of leases held, and determine why leases are not being cleaned up. This patch makes the behavior described in, for example, #9395 easier to debug and understand as it makes it clear that leases are never being revoked, rather than renewed forever.

Please read https://github.com/coreos/etcd/blob/master/CONTRIBUTING.md#contribution-flow.

@idiamond-stripe
Copy link
Contributor Author

Given #9764, I'm not sure whether these should live in the etcd_debugging namespace or the etcd namespace.

@gyuho
Copy link
Contributor

gyuho commented May 25, 2018

We will decide whether to keep debugging namespace in our next community meeting https://github.com/coreos/etcd#community-meetings.

@codecov-io
Copy link

Codecov Report

❗ No coverage uploaded for pull request base (master@f5e52c9). Click here to learn what that means.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master    #9778   +/-   ##
=========================================
  Coverage          ?   69.54%           
=========================================
  Files             ?      376           
  Lines             ?    35215           
  Branches          ?        0           
=========================================
  Hits              ?    24489           
  Misses            ?     8962           
  Partials          ?     1764
Impacted Files Coverage Δ
lease/lessor.go 86.75% <100%> (ø)
lease/metrics.go 100% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f5e52c9...6e06d19. Read the comment docs.

@xiang90
Copy link
Contributor

xiang90 commented May 25, 2018

@gyuho I am not a fan of many metrics. I hope normal users can only need to pay attention to a small set of well thought metrics. Other metrics are for debugging.

@gyuho
Copy link
Contributor

gyuho commented May 25, 2018

@xiang90 Agree. I will go through all metrics, and select a few important ones (e.g. DB size in use).

lease/metrics.go Outdated
Namespace: "etcd_debugging",
Subsystem: "lease",
Name: "renewed_total",
Help: "The total number of renewed leases.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lease renew is not a cluster wide operation, it only goes through the leader node at least for now. needs to add some clarification here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll update this.

@xiang90
Copy link
Contributor

xiang90 commented May 25, 2018

lgtm

lease/metrics.go Outdated
@@ -0,0 +1,59 @@
// Copyright 2015 The etcd Authors
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/2015/2018/?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll update this.

lease/metrics.go Outdated
Name: "total_ttl",
Help: "Bucketed histogram of lease TTLs.",
// 1 second -> 1 month
Buckets: prometheus.ExponentialBuckets(1, 2, 24),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pow(2,23) * 1 == 8388608 seconds == 97 days?

Do we really need this high upper bound?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leases can be up to 9E9 seconds, so setting this fairly high seemed reasonable, but I'm happy to decrease it if you'd prefer!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then, can you update the comment instead? 1 second -> 3 months?

lease/metrics.go Outdated
prometheus.HistogramOpts{
Namespace: "etcd_debugging",
Subsystem: "lease",
Name: "total_ttl",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also rename to ttl_total to be consistent with other metrics.

Copy link
Contributor Author

@idiamond-stripe idiamond-stripe May 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok!

This patch adds four metrics to the `leases` package for easier
debugging.
Copy link
Contributor

@gyuho gyuho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm thanks @idiamond-stripe

@gyuho gyuho merged commit ba10640 into etcd-io:master May 25, 2018
bdarnell added a commit to cockroachdb/vendored that referenced this pull request Jun 27, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

4 participants