Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sdk/metrics: performance impact of exemplars support even when configured as always_off #4243

Open
emdneto opened this issue Oct 31, 2024 · 7 comments · May be fixed by #4253
Open

sdk/metrics: performance impact of exemplars support even when configured as always_off #4243

emdneto opened this issue Oct 31, 2024 · 7 comments · May be fixed by #4253
Labels
metrics sdk Affects the SDK package.

Comments

@emdneto
Copy link
Member

emdneto commented Oct 31, 2024

After d5fb2c4 got merged we had a huge drop on iter/sec in metrics SDK benchmark for all metric types:

image

It seems expected when the Exemplars feature is enabled (the default behavior), but according to the spec, setting the exemplar filter to always_off shouldn't introduce any overhead. Running the benchmark locally, we still have the same overhead even when "disabling" (set to always_off) the Exemplars feature.

AlwaysOff
An ExemplarFilter which makes no measurements eligible for being an Exemplar. Using this ExemplarFilter is as good as disabling Exemplar feature.

Benchmark Results Summary

Metric before exemplars a8aacb0 trace_based (main) always_off (main)
Min (µs) 8.3550 13.1880 13.1440
Max (µs) 23.5780 36.9060 94.2870
Mean (µs) 8.9926 14.4266 16.6942
StdDev (µs) 1.0477 2.6945 10.0739
Median (µs) 8.7910 13.8030 13.9150
IQR (µs) 0.2520 0.5620 0.5935
OPS (Kops/s) 111.2024 69.3162 59.9010
Rounds 690 273 277
Iterations 1 1 1
@emdneto emdneto changed the title sdk/metrics: performance impact of exemplars support even when always_off sdk/metrics: performance impact of exemplars support even when configured as always_off Oct 31, 2024
@emdneto emdneto added sdk Affects the SDK package. metrics labels Oct 31, 2024
@fcollonval
Copy link
Contributor

Info from analysis by @emdneto and discussed between us on Slack:

I ran a profiler to see what's happening in the code and seems the overhead is in the get_current_span and get_current() function calls
I ran the profiler against the main and in a commit before exemplars PR. I can share the results later, but I think the overhead is expected
But maybe we can make exemplars feature opt-in and warns about performance overhead

I definitely can help making the exemplars feature opt-in without performance overhead.
From the spec, https://opentelemetry.io/docs/specs/otel/metrics/sdk/,
If Exemplar sampling is off, the SDK MUST NOT have overhead related to exemplar sampling.
ExemplarFilter.AlwaysOff
An ExemplarFilter which makes no measurements eligible for being an Exemplar. Using this ExemplarFilter is as good as disabling Exemplar feature.
I would propose to ensure that if that filter is set up (possible through env var), no overhead should be added. This will avoid having to add a new flag.
What do you think?

Looking a bit more at the changes done, there is gonna be an issue with the get_current calls to add the context to the measurement.

So I would propose to revert the API changes that add the context to Measurement to request the context only when it is needed - aka when the exemplar code is looking for trace and span.

@emdneto
Copy link
Member Author

emdneto commented Nov 6, 2024

The problem also is if we have several different aggregations (for example, when generating metrics with dynamic attributes), it will create several reservoirs with several ExemplarsBuckets. On the other hand, each aggregation is created with the reference for theReservoirFactory. Since the benchmark test run with dynamic attributes, it explains the huge drop in metrics-sdk benchmark.

Image of a benchmark test run using EXEMPLAR_FILTER=always_off
image

TLDR;
The performance is not affected if we have static attributes on the measurement. However, if we have dynamic attributes (which implies creating new aggregations for each combination of attributes), we'll have the same degraded behavior as the benchmark test because of the instantiation of several ExemplarsReservoirs.

So, at least, I expect if the Exemplar's feature is disabled (always_off), there's no need to instantiate a FixedSizeReservoir every time a new aggregation is created. So maybe we can have a NoOpReservoir instead when the exemplar filter is always_off.

@aabmass
Copy link
Member

aabmass commented Nov 7, 2024

@emdneto when you say dynamic attributes, you mean the benchmark is generating changing attributes for each iteration vs using the same attribute set each time?

So, at least, I expect if the Exemplar's feature is disabled (always_off), there's no need to instantiate a FixedSizeReservoir every time a new aggregation is created. So maybe we can have a NoOpReservoir instead when the exemplar filter is always_off.

This seems like a reasonable optimization. Not sure if it's better, but we could make the FixedSizeReservoir behave more lazily. That profile points to the constructor which eagerly generates it's reservoir

self._reservoir_storage: List[ExemplarBucket] = [
ExemplarBucket() for _ in range(self._size)
]

@xrmx
Copy link
Contributor

xrmx commented Nov 8, 2024

I would also benchmark this again after #4251 because it fixed always_off to work properly.

@emdneto
Copy link
Member Author

emdneto commented Nov 8, 2024

@emdneto when you say dynamic attributes, you mean the benchmark is generating changing attributes for each iteration vs using the same attribute set each time?

So, at least, I expect if the Exemplar's feature is disabled (always_off), there's no need to instantiate a FixedSizeReservoir every time a new aggregation is created. So maybe we can have a NoOpReservoir instead when the exemplar filter is always_off.

This seems like a reasonable optimization. Not sure if it's better, but we could make the FixedSizeReservoir behave more lazily. That profile points to the constructor which eagerly generates it's reservoir

self._reservoir_storage: List[ExemplarBucket] = [
ExemplarBucket() for _ in range(self._size)
]

  1. Exactly! Benchmark tests use different attributes (which are fine and represent the reality of many systems with high cardinality metrics).

@xrmx
Copy link
Contributor

xrmx commented Nov 8, 2024

With this I get a bit less than 2x loops in always_off and trace_based running .tox/benchmark-opentelemetry-sdk/bin/pytest --profile opentelemetry-sdk/benchmarks/metrics/ -k 'test_histogram_record_1000[7]':

diff --git a/opentelemetry-sdk/src/opentelemetry/sdk/metrics/_internal/exemplar/exemplar_reservoir.py b/opentelemetry-sdk/src/opentelemetry/sdk/metrics/_internal/exemplar/exemplar_reservoir.py
index c8fa7f14..12165821 100644
--- a/opentelemetry-sdk/src/opentelemetry/sdk/metrics/_internal/exemplar/exemplar_reservoir.py
+++ b/opentelemetry-sdk/src/opentelemetry/sdk/metrics/_internal/exemplar/exemplar_reservoir.py
@@ -155,9 +155,7 @@ class FixedSizeExemplarReservoirABC(ExemplarReservoir):
     def __init__(self, size: int, **kwargs) -> None:
         super().__init__(**kwargs)
         self._size: int = size
-        self._reservoir_storage: List[ExemplarBucket] = [
-            ExemplarBucket() for _ in range(self._size)
-        ]
+        self._reservoir_storage: Mapping[int, ExemplarBucket] = {}
 
     def collect(self, point_attributes: Attributes) -> List[Exemplar]:
         """Returns accumulated Exemplars and also resets the reservoir for the next
@@ -173,10 +171,7 @@ class FixedSizeExemplarReservoirABC(ExemplarReservoir):
         """
         exemplars = filter(
             lambda e: e is not None,
-            map(
-                lambda bucket: bucket.collect(point_attributes),
-                self._reservoir_storage,
-            ),
+            (bucket.collect(point_attributes) for _, bucket in sorted(self._reservoir_storage.items()))
         )
         self._reset()
         return [*exemplars]
@@ -201,6 +196,7 @@ class FixedSizeExemplarReservoirABC(ExemplarReservoir):
                 value, time_unix_nano, attributes, context
             )
 
+            self._reservoir_storage.setdefault(index, ExemplarBucket())
             self._reservoir_storage[index].offer(
                 value, time_unix_nano, attributes, context
             )
@@ -296,6 +292,7 @@ class AlignedHistogramBucketExemplarReservoir(FixedSizeExemplarReservoirABC):
         index = self._find_bucket_index(
             value, time_unix_nano, attributes, context
         )
+        self._reservoir_storage.setdefault(index, ExemplarBucket())
         self._reservoir_storage[index].offer(
             value, time_unix_nano, attributes, context
         )

@xrmx
Copy link
Contributor

xrmx commented Nov 8, 2024

@fcollonval please take a look at #4260

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
metrics sdk Affects the SDK package.
Projects
None yet
4 participants