sdk/metrics: performance impact of exemplars support even when configured as `always_off` #4243

emdneto · 2024-10-31T16:42:49Z

After d5fb2c4 got merged we had a huge drop on iter/sec in metrics SDK benchmark for all metric types:

It seems expected when the Exemplars feature is enabled (the default behavior), but according to the spec, setting the exemplar filter to always_off shouldn't introduce any overhead. Running the benchmark locally, we still have the same overhead even when "disabling" (set to always_off) the Exemplars feature.

AlwaysOff
An ExemplarFilter which makes no measurements eligible for being an Exemplar. Using this ExemplarFilter is as good as disabling Exemplar feature.

Benchmark Results Summary

Metric	before exemplars `a8aacb0`	trace_based (main)	always_off (main)
Min (µs)	8.3550	13.1880	13.1440
Max (µs)	23.5780	36.9060	94.2870
Mean (µs)	8.9926	14.4266	16.6942
StdDev (µs)	1.0477	2.6945	10.0739
Median (µs)	8.7910	13.8030	13.9150
IQR (µs)	0.2520	0.5620	0.5935
OPS (Kops/s)	111.2024	69.3162	59.9010
Rounds	690	273	277
Iterations	1	1	1

The text was updated successfully, but these errors were encountered:

fcollonval · 2024-11-06T09:04:51Z

Info from analysis by @emdneto and discussed between us on Slack:

I ran a profiler to see what's happening in the code and seems the overhead is in the get_current_span and get_current() function calls
I ran the profiler against the main and in a commit before exemplars PR. I can share the results later, but I think the overhead is expected
But maybe we can make exemplars feature opt-in and warns about performance overhead

I definitely can help making the exemplars feature opt-in without performance overhead.
From the spec, https://opentelemetry.io/docs/specs/otel/metrics/sdk/,
If Exemplar sampling is off, the SDK MUST NOT have overhead related to exemplar sampling.
ExemplarFilter.AlwaysOff
An ExemplarFilter which makes no measurements eligible for being an Exemplar. Using this ExemplarFilter is as good as disabling Exemplar feature.
I would propose to ensure that if that filter is set up (possible through env var), no overhead should be added. This will avoid having to add a new flag.
What do you think?

Looking a bit more at the changes done, there is gonna be an issue with the get_current calls to add the context to the measurement.

So I would propose to revert the API changes that add the context to Measurement to request the context only when it is needed - aka when the exemplar code is looking for trace and span.

emdneto · 2024-11-06T14:51:21Z

The problem also is if we have several different aggregations (for example, when generating metrics with dynamic attributes), it will create several reservoirs with several ExemplarsBuckets. On the other hand, each aggregation is created with the reference for theReservoirFactory. Since the benchmark test run with dynamic attributes, it explains the huge drop in metrics-sdk benchmark.

Image of a benchmark test run using EXEMPLAR_FILTER=always_off

TLDR;
The performance is not affected if we have static attributes on the measurement. However, if we have dynamic attributes (which implies creating new aggregations for each combination of attributes), we'll have the same degraded behavior as the benchmark test because of the instantiation of several ExemplarsReservoirs.

So, at least, I expect if the Exemplar's feature is disabled (always_off), there's no need to instantiate a FixedSizeReservoir every time a new aggregation is created. So maybe we can have a NoOpReservoir instead when the exemplar filter is always_off.

aabmass · 2024-11-07T20:50:58Z

@emdneto when you say dynamic attributes, you mean the benchmark is generating changing attributes for each iteration vs using the same attribute set each time?

So, at least, I expect if the Exemplar's feature is disabled (always_off), there's no need to instantiate a FixedSizeReservoir every time a new aggregation is created. So maybe we can have a NoOpReservoir instead when the exemplar filter is always_off.

This seems like a reasonable optimization. Not sure if it's better, but we could make the FixedSizeReservoir behave more lazily. That profile points to the constructor which eagerly generates it's reservoir

opentelemetry-python/opentelemetry-sdk/src/opentelemetry/sdk/metrics/_internal/exemplar/exemplar_reservoir.py

Lines 158 to 160 in 43804f3

    
           self._reservoir_storage: List[ExemplarBucket] = [ 
        
               ExemplarBucket() for _ in range(self._size) 
        
           ]

xrmx · 2024-11-08T11:36:20Z

I would also benchmark this again after #4251 because it fixed always_off to work properly.

emdneto · 2024-11-08T14:40:55Z

@emdneto when you say dynamic attributes, you mean the benchmark is generating changing attributes for each iteration vs using the same attribute set each time?

So, at least, I expect if the Exemplar's feature is disabled (always_off), there's no need to instantiate a FixedSizeReservoir every time a new aggregation is created. So maybe we can have a NoOpReservoir instead when the exemplar filter is always_off.

This seems like a reasonable optimization. Not sure if it's better, but we could make the FixedSizeReservoir behave more lazily. That profile points to the constructor which eagerly generates it's reservoir

opentelemetry-python/opentelemetry-sdk/src/opentelemetry/sdk/metrics/_internal/exemplar/exemplar_reservoir.py

Lines 158 to 160 in 43804f3

self._reservoir_storage: List[ExemplarBucket] = [

ExemplarBucket() for _ in range(self._size)

]

Exactly! Benchmark tests use different attributes (which are fine and represent the reality of many systems with high cardinality metrics).

xrmx · 2024-11-08T16:04:35Z

With this I get a bit less than 2x loops in always_off and trace_based running .tox/benchmark-opentelemetry-sdk/bin/pytest --profile opentelemetry-sdk/benchmarks/metrics/ -k 'test_histogram_record_1000[7]':

diff --git a/opentelemetry-sdk/src/opentelemetry/sdk/metrics/_internal/exemplar/exemplar_reservoir.py b/opentelemetry-sdk/src/opentelemetry/sdk/metrics/_internal/exemplar/exemplar_reservoir.py
index c8fa7f14..12165821 100644
--- a/opentelemetry-sdk/src/opentelemetry/sdk/metrics/_internal/exemplar/exemplar_reservoir.py
+++ b/opentelemetry-sdk/src/opentelemetry/sdk/metrics/_internal/exemplar/exemplar_reservoir.py
@@ -155,9 +155,7 @@ class FixedSizeExemplarReservoirABC(ExemplarReservoir):
     def __init__(self, size: int, **kwargs) -> None:
         super().__init__(**kwargs)
         self._size: int = size
-        self._reservoir_storage: List[ExemplarBucket] = [
-            ExemplarBucket() for _ in range(self._size)
-        ]
+        self._reservoir_storage: Mapping[int, ExemplarBucket] = {}
 
     def collect(self, point_attributes: Attributes) -> List[Exemplar]:
         """Returns accumulated Exemplars and also resets the reservoir for the next
@@ -173,10 +171,7 @@ class FixedSizeExemplarReservoirABC(ExemplarReservoir):
         """
         exemplars = filter(
             lambda e: e is not None,
-            map(
-                lambda bucket: bucket.collect(point_attributes),
-                self._reservoir_storage,
-            ),
+            (bucket.collect(point_attributes) for _, bucket in sorted(self._reservoir_storage.items()))
         )
         self._reset()
         return [*exemplars]
@@ -201,6 +196,7 @@ class FixedSizeExemplarReservoirABC(ExemplarReservoir):
                 value, time_unix_nano, attributes, context
             )
 
+            self._reservoir_storage.setdefault(index, ExemplarBucket())
             self._reservoir_storage[index].offer(
                 value, time_unix_nano, attributes, context
             )
@@ -296,6 +292,7 @@ class AlignedHistogramBucketExemplarReservoir(FixedSizeExemplarReservoirABC):
         index = self._find_bucket_index(
             value, time_unix_nano, attributes, context
         )
+        self._reservoir_storage.setdefault(index, ExemplarBucket())
         self._reservoir_storage[index].offer(
             value, time_unix_nano, attributes, context
         )

xrmx · 2024-11-08T16:49:18Z

@fcollonval please take a look at #4260

emdneto changed the title ~~sdk/metrics: performance impact of exemplars support even when always_off~~ sdk/metrics: performance impact of exemplars support even when configured as always_off Oct 31, 2024

emdneto added sdk Affects the SDK package. metrics labels Oct 31, 2024

fcollonval linked a pull request Nov 6, 2024 that will close this issue

Improve performance: Don't request current context when creating a measurement #4253

Draft

4 tasks

xrmx mentioned this issue Nov 8, 2024

opentelemetry-sdk: speed up exemplars a bit #4260

Open

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sdk/metrics: performance impact of exemplars support even when configured as `always_off` #4243

sdk/metrics: performance impact of exemplars support even when configured as `always_off` #4243

emdneto commented Oct 31, 2024 •

edited

Loading

fcollonval commented Nov 6, 2024

emdneto commented Nov 6, 2024 •

edited

Loading

aabmass commented Nov 7, 2024 •

edited

Loading

xrmx commented Nov 8, 2024

emdneto commented Nov 8, 2024

xrmx commented Nov 8, 2024 •

edited

Loading

xrmx commented Nov 8, 2024

sdk/metrics: performance impact of exemplars support even when configured as always_off #4243

sdk/metrics: performance impact of exemplars support even when configured as always_off #4243

Comments

emdneto commented Oct 31, 2024 • edited Loading

Benchmark Results Summary

fcollonval commented Nov 6, 2024

emdneto commented Nov 6, 2024 • edited Loading

aabmass commented Nov 7, 2024 • edited Loading

xrmx commented Nov 8, 2024

emdneto commented Nov 8, 2024

xrmx commented Nov 8, 2024 • edited Loading

xrmx commented Nov 8, 2024

sdk/metrics: performance impact of exemplars support even when configured as `always_off` #4243

sdk/metrics: performance impact of exemplars support even when configured as `always_off` #4243

emdneto commented Oct 31, 2024 •

edited

Loading

emdneto commented Nov 6, 2024 •

edited

Loading

aabmass commented Nov 7, 2024 •

edited

Loading

xrmx commented Nov 8, 2024 •

edited

Loading