Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Summary to the Aggregation in the SDK specifications #2704

Closed
asafm opened this issue Aug 1, 2022 · 9 comments
Closed

Add Summary to the Aggregation in the SDK specifications #2704

asafm opened this issue Aug 1, 2022 · 9 comments
Assignees
Labels
[label deprecated] triaged-rejected [label deprecated] The issue is triaged and rejected by the OTel community spec:metrics Related to the specification/metrics directory

Comments

@asafm
Copy link
Contributor

asafm commented Aug 1, 2022

What are you trying to achieve?
I would like the API Specifications for Metrics to support the Summary data type

Additional context.

I saw that the Metrics Data Model and the Open Telemetry Protocol support Summar (appears here under Metrics points -> Summary (Legacy).
Then I tried using it in the Java SDK and I couldn't find Summary. I did find an issue that stated for the SDK to support Summary the API Specification must support it first. This is the reason I opened this issue.

I'm working on a big epic in Apache Pulsar which refactors the whole metrics libraries. One library that came up was Open Telemetry. The only "small" problem is that Apache Pulsar uses Summary objects heavily, and can make two really big changes at once, so it made sense that if the Open Telemetry protocol itself supports Summary, why not the API and Java SDK would support it as well, making this migration possible.

@asafm asafm added the spec:metrics Related to the specification/metrics directory label Aug 1, 2022
@jack-berg
Copy link
Member

The java SDK data model and its OTLP metric exporters do support summaries. This means that you can implement the MetricData interface with a type of MetricDataType.SUMMARY. And you can pass MetricData to MetricExporter#export(Collection) implementations like OtlpGrpcMetricExporter to export the summaries out of process over OTLP.

It's only the metrics API that doesn't support summaries. The API what you would use in place of whatever is recording the individual measurements that are aggregated into your current summaries. It seems unlikely to me that the API would change to have some purpose built facility to accommodate summaries, since histograms are intended to fill that type of use case.

@jmacd
Copy link
Contributor

jmacd commented Aug 1, 2022

I believe users should use Histogram instruments configured with Summary aggregators to achieve the desired outcome. Semantically, there is not a difference between the proposed Summary and the current Histogram instrument.

We recognize that it is difficult to do what I am suggesting today -- configure a specific Histogram instrument with a specific aggregator -- outside of using Views, which have certain strong requirements to use. See #1753 or #2229.

In Lightstep's prototype, I created an API Hint mechanism as a way to configure specific aggregators. I have also created a "minmaxsumcount" alternative aggregator for Histogram instruments, which is a similar use-case to yours: simply a different way to aggregate a distribution. https://github.com/lightstep/otel-launcher-go/tree/main/lightstep/sdk/metric#metric-instrument-hints-api

For the record, one of the difficulties in specifying a Summary aggregator is that the Prometheus-equivalent data type is fixed at Cumualative temporality. What would an SDK do if you requested summary for Delta temporality? (If this is a case that doesn't work, fine with me.)

@asafm
Copy link
Contributor Author

asafm commented Aug 2, 2022

@jack-berg If the specification group included Summary in the protocol, mainly for legacy support, why not complete the support all the way to the API?

The motivation as I stated above is mainly migration. I would like to migrate the entire Apache Pulsar and possible Apache BookKeeper codebase to OpenTelemetry - it's a humongous project, which might be sliced. Since it's a 10-year-old project, of course, it contains Summaries - many of them. The only I can move forward with such migration is to have support for Summary in the Java SDK of OpenTelemetry. I'm pretty sure that those 2 projects are not the only open source projects in Apache that uses Summaries. Having a path for them is a clear win for OT IMO.

Your suggestion is what you called out-of-process, but this will reduce the developer experience of committers of Pulsar. I want to have a Summary object that can use, register and it will participate in the exporting of those metrics both to OT Collector and the experimental Prometheus one. If it's possible for me to do, then I probably didn't understand the recipe.

@jmacd - so you're saying I can develop a Summary object which will be a Histogram but with a specific Aggregate Handle that stores the data and exposes it as quantiles? I couldn't find the code that reads that data into an exporter, so I can better understand how it happens. Would love more hints in that direction.

@jmacd
Copy link
Contributor

jmacd commented Aug 2, 2022

@asafm

I support letting each OTel SDK do what's natural for users as long as stays within the SDK. The SDK will likely have internal interfaces for representing Aggregators and data-transfer interfaces for conveying data points between the reader and the exporter. The SDK will have a Views mechanism for choosing which Aggregator to use for each instrument (it's settings, etc.). What I propose is:

  • SDKs should consider allowing users to bring their own aggregator type and use it by registering the name of the user-supplied type.
  • OTel's API should consider (post-1.0) a mechanism for API hints (Add the metrics API Hint #1753)

With these two features combined, your migration path will replace uses of Summary with OTel Histograms having the necessary hints. For a legacy application, I would expect the hint to apply to the entire instrumentation library, so IMO this should take only a single hint to the instrumentation library, then all your Histograms instruments (in that library) will output Summary data points.

@jmacd
Copy link
Contributor

jmacd commented Aug 2, 2022

One option that OpenTelemetry could explore after its 1.0 milestone, if there's interest, is to specify a Summary aggregation that closely matches the Prometheus definition that could be derived from the exponential histogram. The OTLP Summary data point will require extensions to support delta temporality. When the potential to use Summary data points with other temporality than the present definition has been discussed in the past, there was not enough interest.

@rbailey7210 rbailey7210 added the [label deprecated] triaged-rejected [label deprecated] The issue is triaged and rejected by the OTel community label Aug 5, 2022
@reyang reyang closed this as completed Sep 9, 2022
@asafm
Copy link
Contributor Author

asafm commented Dec 28, 2022

After thoroughly reading the spec, both API and SDK, and reading part of the implementations of Java SDK and Go SDK, I finally have the missing context I needed to reply :)

I would like to clarify the request: I would like the SDK specification, specifically the Aggregation part, to include Summary. Specifically, this part:
image

Coupled with providing hints in the API to specify the aggregation for a histogram, this will allow using a Summary both by an official SDK (specifically the Java one) or by any 3rd party SDK implementing the OTel API. Something of the following form:

meter.histogramBuilder()
	.setDescription()
	.setUnit()
	.ofLongs()
	// New addition of hints API
	.aggregationType(SummaryAgggregationType.builder()
					   	.setQuantiles(0.5, 0.75, 1.0)
					   	.build())
	.build()				   	

@jmacd If SDK would allow free-form aggregators, it probably means I can use that to create a summary aggregator which emits metric data point of type summary. Yet there is a strong downside to this approach - it's not ergonomic - i.e., not user-friendly. Defining a histogram instrument in PersistentTopic.java and then continuing its definition in PulsarService.java in OTel initialization to specify broker.messages.incoming.latency is actually a Summary aggregation with quantiles (0.5, 0.75, 0.9, 0.99) is not developer friendly: the API doesn't lead users to it (you need to guesswork it), it's confusing ("I don't get it, where do I define it's a Summary").
Hints solve the ergonomics of it, but to complete it all way, Summary must be present as Aggregator in the SDK spec; otherwise, hints will be there without Summary, so we're back to square one.

@jack-berg I agree that Histogram fills that gap, so I correct myself: API should not change, only the SDK Spec Aggregation part needs to change - add Summary as an aggregation, which, together with Hints, will solve it end to end.

@asafm asafm changed the title Add support for Summary in API specifications Add Summary to the Aggregation in the SDK specifications Dec 28, 2022
@asafm
Copy link
Contributor Author

asafm commented Dec 28, 2022

@reyang If what I said above makes sense can this issue be re-opened?

@jack-berg
Copy link
Member

Here are some of the problems that I think need to be addressed in a summary aggregation:

  • How does temporality work with summary aggregation? Exponential and explicit bucket histogram's have temporality at the proto level. Summary does not, although temporality seemingly applies count and sum (and arguably quantiles). Would the proto need to be extended to include a temporality field?
  • How are quantiles calculated? Prometheus summary quantiles use a sliding window which I can't find a clear definition for. The prometheus java implementation allows you to configure things like allowable error, max age, and max age buckets. This post explains that the prometheus java implementations use the CKMS algorithm. Do all prometheus implementations calculate the same way? Would an OpenTelemetry summary aggregation be expected to compute quantiles using the same algorithm? Would an OpenTelemetry summary aggregation expose the same configuration options? Are quantiles computed independently of aggregation temporality?

Its not a trivial amount of work, and I question whether it's worth pursuing. Would it not be easier to switch to histograms (explicit or exponential) and a pattern where quantiles are computed by whatever system the metrics are being sent to (either at read time or write time)?

@asafm
Copy link
Contributor Author

asafm commented Jan 6, 2023

@jack-berg, thanks a lot for answering.

First, I agree with you it's a challenging effort and involve work to be done. Since summary has existed for a long time, perhaps we can reuse existing work like reading how different libraries in different languages have done it and deciding on one algorithm/implementation for it.

Second, I agree that it's questionable since some can just switch to histograms, yet here is why I think it's worth the time:

  • OTel aims to be an industry-wide standard. In reality, an idea becomes a standard once a majority uses it. Prometheus format is one example, and in java, SLF4J is another example. Switching over has a cost - development cost. You can argue that OTel is superior to previously used metric libraries, and I agree, but I don't think its added value would balance the effort if it were large. An effort will be large if the change from existing metrics to OTel is big. If you have a Summary today in your code, thus not having one in OTel creates friction to migration since it's an effort to switch over to histogram - it changes your dashboard, alerts, etc. OTel's goal should be to smooth migration, sort of "backward compatible". Having Summary as aggregation makes it so.
    You could say nobody uses Summary. I think, just like Apache Pulsar, many open-source and internal projects have been developed for the last ten years, and back then, it was popular. So I think if we want to make sure the majority of people use OTel, it's worth adding a Summary.
  • Histogram, specifically with explicit buckets, can give you in some cases a large margin of error when computing quantiles on it (I have seen it personally). I don't understand deeply enough Exponential Histograms, so perhaps having automatically computed buckets would lower this error to very small; it might suffice. If not, Summary gives you a much more accurate result, on the expense of not being aggregatable. It's good when you zoom in on specific server.

How does temporality work with summary aggregation? Exponential and explicit bucket histogram's have temporality at the proto level. Summary does not, although temporality seemingly applies count and sum (and arguably quantiles). Would the proto need to be extended to include a temporality field?

In my understanding, a Summary is, by definition, a delta temporality. It gives you the quantiles of the last X minutes. It's basically the same if you had Average aggregation or Rate - it doesn't make sense to compute it from when the process starts. Hence, I don't see any value in having a cumulative summary, thus maybe there isn't a need for temporality field in the proto.

How are quantiles calculated? Prometheus summary quantiles use a sliding window which I can't find a clear definition for. The prometheus java implementation allows you to configure things like allowable error, max age, and max age buckets. This post explains that the prometheus java implementations use the CKMS algorithm. Do all prometheus implementations calculate the same way? Would an OpenTelemetry summary aggregation be expected to compute quantiles using the same algorithm? Would an OpenTelemetry summary aggregation expose the same configuration options? Are quantiles computed independently of aggregation temporality?

All those questions are valid questions that needs answer as part of the effort to introduce it in the specifications. Small note: Apache Pulsar and Apache BookKeeper uses Apache DataSketches libraries which implement a family of streaming algorithms described here - the reason by the way that Prometheus implementation created more memory pressure (GC).

In my opinion, it's worth having it, or even keep the issue opened and maybe mark it as open for volunteers?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
[label deprecated] triaged-rejected [label deprecated] The issue is triaged and rejected by the OTel community spec:metrics Related to the specification/metrics directory
Projects
None yet
Development

No branches or pull requests

5 participants