-
Notifications
You must be signed in to change notification settings - Fork 175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why is adding new attributes to a metric a breaking change? #722
Comments
I agree it would be great to document the rationale since this has been surprising to many people. In the meantime, I think the best we have is @jsuereth's document |
It would be great to clarify that. If we would apply that rule strictly, wouldn't we need to say that instrumentations MUST NOT add any additional attributes to metrics that are not covered by the specification? For example, if ASP.NET Core 8 adds an additional attribute to
|
That's the exact wording from the spec:
|
* Centralize attributes definition Signed-off-by: Bogdan Drutu <[email protected]> * Update specification/common/common.md Co-authored-by: Christian Neumüller <[email protected]> * Update specification/common/common.md Co-authored-by: Christian Neumüller <[email protected]> * Update specification/common/common.md Co-authored-by: Christian Neumüller <[email protected]> * Update specification/common/common.md Co-authored-by: Christian Neumüller <[email protected]> * Update specification/common/common.md Co-authored-by: Christian Neumüller <[email protected]> * Update specification/common/common.md Co-authored-by: Christian Neumüller <[email protected]> * Update specification/overview.md Co-authored-by: Christian Neumüller <[email protected]> Co-authored-by: Christian Neumüller <[email protected]>
* Centralize attributes definition Signed-off-by: Bogdan Drutu <[email protected]> * Update specification/common/common.md Co-authored-by: Christian Neumüller <[email protected]> * Update specification/common/common.md Co-authored-by: Christian Neumüller <[email protected]> * Update specification/common/common.md Co-authored-by: Christian Neumüller <[email protected]> * Update specification/common/common.md Co-authored-by: Christian Neumüller <[email protected]> * Update specification/common/common.md Co-authored-by: Christian Neumüller <[email protected]> * Update specification/common/common.md Co-authored-by: Christian Neumüller <[email protected]> * Update specification/overview.md Co-authored-by: Christian Neumüller <[email protected]> Co-authored-by: Christian Neumüller <[email protected]>
@pyohannes E.g. when my application goes from no-routing ( Or when my application never reported any errors (lucky!) starts to report some errors in I.e. breaking (that we want to prevent) happens when an existing time series splits, but not when new data that has never been seen before comes with new dimensions. |
What if I switch from .NET 6 to .NET 8, and |
Reading the description of that metric:
I'm inclined to say it's not a breaking change, but I may be wrong as I haven't looked at how it will be booked. If I get it right, the When I think it's a problem of adding a new attribute is what Josh described in his document (one @trask linked above): Imagine you have an alert based on a metric that is booked by an instrumentation with two attributes
Then, the instrumentation library adds a new attribute to the same code line where it used to book only the two attributes. Now what happens is, we have "2" time series, the first with 2 attributes and the new with 3. The problem and why it is breaking, is the library will not book data in first time series anymore (with 2 attributes only). It will only book on the new time series, with 3 attributes. This is when my alert is broken, because I won't get more data in it.
One way to "fix" this problem and make adding new attributes backwards compatible, is instrumentations should still book a metric for the previous attributes and another one for the new.
|
Thanks @joaopgrassi, I understand better now.
This would need a completely different metric name, otherwise you'll get wrong values for aggregations of time series.
Understanding the problem a bit better now, I realize this can possibly break for any backend that supports proper PromQL. Given you have a metric
Now given that you introduce an additional attribute Consequently, this is an issue at least for backends supporting PromQL, which means it isn't just a niche problem. |
Should we migrate this rationale description into some kind of guidance on semconv? I'd love to not repeat this discussion every time someone unfamiliar with Prometheus/PromQL (and its ilk) join semconv.... Let me know where and we can migrate the google doc @trask linked to there. |
It could be but I think otel-dotnet folks did some effort to make .NET6 and .NET 8 metrics to be consistent before HTTP instr went stable. I think the breaking changes we should focus on in the semconv are instrumentation library updates when user didn't do anything in their code that would provoke them. E.g. if I switch from one messaging system to another, I will need to update my dashboards and alerts. Or if I start using new API that emits additional attributes on the same metric (e.g. I used messaging receive API and now I use processing API). I don't see how we can prevent breaking like this. |
💯% A good place for it is https://github.com/open-telemetry/semantic-conventions/blob/main/docs/general/metrics.md |
@lmolkova touched on an example (#722 (comment)) that I want to dig into a bit deeper because I'm still confused 😕 even after reading @jsuereth's doc and the discussion here...
Making this a bit more concrete with some pseudocode, I'm confident that the community has shipped stable HTTP instrumentation across a number of languages that looks something like the following: func RecordResponseTime(duration, httpRoute, statusCode, errorType) {
var histogram = meter.CreateHistogram("http.server.request.duration");
if (status_code < 500) {
histogram.Record(duration, {
"http.route" = httpRoute,
"http.response.status_code" = statusCode
})
} else {
histogram.Record(duration, {
"http.route" = httpRoute,
"http.response.status_code" = statusCode,
"error.type" = errorType
});
}
} If my application has been running flawlessly for 10 days and then on the 11th day some percentage of requests now experience errors. Both this example and the scenario of upgrading instrumentation which adds an attribute to an existing metric seem to share the same risk. Don't they both have the potential of breaking my dashboards and alerts when using Prometheus/PromQL because there are now at least two timeseries? Back in the day, before OpenTelemetry, if I was instrumenting my code with a Prometheus client library, would I be expected to have written my instrumentation ensuring the metric always recorded the same set of attributes? Something like: func RecordResponseTime(duration, httpRoute, statusCode, errorType) {
var histogram = meter.CreateHistogram("http.server.request.duration");
if (errorType == null) {
errorType = "unset";
}
histogram.Record(duration, {
"http.route" = httpRoute,
"http.response.status_code" = statusCode,
"error.type" = errorType
});
} It's true, I'm not an expert with Prometheus/PromQL, but I'm surprised this is a problem that needs to be addressed by instrumentation authors rather than alert/dashboard authors. Is it difficult or unusual practice to write alerts that are not brittle using PromQL? For example, if I wanted to alert when overall request latency exceeds a threshold, it seems like it would be a bug not to craft the alert to aggregate all timeseries into one. I have to imagine I'm still missing something... |
This is more about a contract between instrumentation authors and alert/dashboard authors. The question being whether instrumentation provide a fixed or extensible set of metric attributes.
I would say no. The idea to define an alert that doesn't define aggregations on a fixed set of attributes didn't occur to me before this discussion, but I'm not sure this can be generalized. It's for sure possible to do it and to define alerts that are "brittle" in that sense. |
The assumption is that users should write alerts/build dashboards based on the metric definition rather than the actual data.
just based on the md/yaml files before any actual data flows into my backend. |
So actually empty labels in prometheus are modeled as such. In your example where you'd set the "error.type" to "unset" is achieved by just not passing error.type. Every metric point effectively "fills out" labels with empty strings in the data model. As Liudmilla points out, always grouping ( |
We discussed this again today in the messaging workgroup, we agreed it would be great to add some reasoning around this to semantic conventions. |
Is there a clear rationale documented some where for why adding new attributes to a metric is a breaking change?
This PR comment is one of the earliest assertions that I'm aware of that adding attributes is breaking, but it has remained unclear to me whether folks have a shared understanding of why.
I do not personally have experience with a backend where this is true, but if it is true for some backend, I think it would be useful to articulate this some where.
The text was updated successfully, but these errors were encountered: