-
Notifications
You must be signed in to change notification settings - Fork 39.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve volume operation metrics #75750
Conversation
/assign @gnufied |
pkg/volume/util/metrics.go
Outdated
}, | ||
[]string{"volume_plugin", "operation_name"}, | ||
[]string{"volume_plugin", "operation_name", "status"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thought I had was also having an "error code" dimension. Then we could potentially have queries to exclude certain types of errors.
But I'm not sure if we can get that level of granularity from the volume operations, which often just do "fmt.Errorf" without any error code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can export "status" or "status_code", with some value representing success (e.g. "OK" or "success" as you have now) and treat everything else as an error. In this PR you could label all errors as e.g. "unknown" and do some refactoring afterwards to surface particular error types without changing the metric definition.
I'm against having both boolean "success"/"failure" AND some error code in another label though.
/retest |
/hold |
pkg/volume/util/metrics.go
Outdated
} | ||
storageOperationMetric.WithLabelValues(plugin, operationName, status).Observe(timeTaken) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this cause existing dashboards to break because now suddenly this metric will include errors as well. Do we need to capture this in a changelog or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a release note with an action required.
Dashboards shouldn't break however now that the metrics are capturing more cases than before, the semantics of the values may change. So any alerts based on this metric may need to be adjusted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After some further thought, I'm going to leave the latency histogram as is (only counting successes), and adding a new "status_count" metric to track number of success and errors
pkg/volume/util/metrics.go
Outdated
}, | ||
[]string{"volume_plugin", "operation_name"}, | ||
[]string{"volume_plugin", "operation_name", "status_code"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is status_code
really appropriate here? When I read that I expect http code-like values, but here we're just getting "success"/"unknown", right? In that case I would either expect this label to be status
or result
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"status" sounds good to me
d973119
to
53f17ec
Compare
Looks good from sig-instrumentation side. /lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: msau42 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Extended the existing e2e test for success counting and added a new e2e test for the failure counting. @gnufied ptal |
/hold cancel |
updatedStorageMetrics := getControllerStorageMetrics(updatedControllerMetrics) | ||
|
||
Expect(len(updatedStorageMetrics.statusMetrics)).ToNot(Equal(0), "Error fetching c-m updated storage metrics") | ||
verifyMetricCount(storageOpMetrics, updatedStorageMetrics, "volume_provision", true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this fail at provision time itself? Do we have to create pod and stuff?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to create the Pod to handle delayed binding cases, where provisioning is not triggered until a Pod is created.
/lgtm |
…-upstream-release-1.14 Automated cherry pick of #75750: Improve volume operation metrics
What type of PR is this?
/kind cleanup
What this PR does / why we need it:
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Does this PR introduce a user-facing change?: