Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve cost of creating AttributeSets #1379

Merged
merged 16 commits into from
Nov 22, 2023

Conversation

KallDrexx
Copy link
Contributor

@KallDrexx KallDrexx commented Nov 16, 2023

Changes

When attempting to call Counter<T>::add() or UpDownCounter<T>::add() on a frequent basis (e.g. per network packet), a significant amount of CPU time was being used by AttributeSet::from() calls.

Replacing the AttributeSet's inner type from a BTreeSet to Vec (+ sort() and deduplication) improves metric throughput (presumably due to more efficient handling of smaller slices). Likewise, pre-caching the hash for the AttributeSet made lookups much faster.

This PR also changes the metrics test to have attributes out of order, to ensure that throughput considers sorting cost.

Stress tests

main branch

Number of threads: 8
Throughput: 4,236,400 iterations/sec
Throughput: 4,253,800 iterations/sec
Throughput: 4,261,400 iterations/sec
Throughput: 4,308,400 iterations/sec
Throughput: 4,389,600 iterations/sec
Throughput: 4,440,800 iterations/sec
Throughput: 4,474,400 iterations/sec
Throughput: 4,460,200 iterations/sec

This PR

Number of threads: 8
Throughput: 5,490,600 iterations/sec
Throughput: 5,496,800 iterations/sec
Throughput: 5,891,800 iterations/sec
Throughput: 5,909,600 iterations/sec
Throughput: 5,830,800 iterations/sec

This ends up being 31% throughput improvement on my machine.

Benchmark result

Counter_Add_Sorted      time:   [698.63 ns 703.35 ns 708.73 ns]
                        change: [-2.6023% -0.8437% +0.7414%] (p = 0.36 > 0.05)
                        No change in performance detected.
Found 7 outliers among 100 measurements (7.00%)
  4 (4.00%) high mild
  3 (3.00%) high severe

Counter_Add_Unsorted    time:   [696.41 ns 703.65 ns 712.87 ns]
                        change: [-4.6853% -2.7787% -1.0534%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  3 (3.00%) high mild
  6 (6.00%) high severe

Merge requirement checklist

  • CONTRIBUTING guidelines followed
  • Unit tests added/updated (if applicable)
  • Appropriate CHANGELOG.md files updated for non-trivial, user-facing changes
  • Changes in public API reviewed (if applicable)

Copy link

linux-foundation-easycla bot commented Nov 16, 2023

CLA Signed

The committers listed above are authorized under a signed CLA.

@KallDrexx
Copy link
Contributor Author

@cijothomas When you get a chance, what is the best way to fix the test failure this PR causes?

Specifically, the prometheus sanitation test uses duplicate attributes (A.B) and uses the previous de-duplication logic which would only result in Q. However, my PR removes de-duplication logic and thus the final result has the two values concatenated.

Based on previous conversations, the C++ and .net libraries do not perform de-duplication, and as far as I"m aware the the OpenTelemetry spec does not say they should be duplicated, which means this behavior might surprise users.

Should I modify the test to expect them to be concatenated?

@cijothomas
Copy link
Member

@cijothomas When you get a chance, what is the best way to fix the test failure this PR causes?

Specifically, the prometheus sanitation test uses duplicate attributes (A.B) and uses the previous de-duplication logic which would only result in Q. However, my PR removes de-duplication logic and thus the final result has the two values concatenated.

Based on previous conversations, the C++ and .net libraries do not perform de-duplication, and as far as I"m aware the the OpenTelemetry spec does not say they should be duplicated, which means this behavior might surprise users.

Should I modify the test to expect them to be concatenated?

prometheus won't be able to scrap successfully if there is duplication, so maybe this is something to be handled at the promethes exporter level. @jtescher Do you have suggestions to unblock? Could we just modify the prometheus test to prevent the issue and document that SDK/Prometheus exporter does not de-dup, and is user responsibility?

@KallDrexx KallDrexx marked this pull request as ready for review November 17, 2023 18:37
@KallDrexx KallDrexx requested a review from a team November 17, 2023 18:37
@jtescher
Copy link
Member

Seems like there are a few places to investigate here, if the instrument add calls are with a fixed set of attributes we could look at changing the methods to accept attribute sets rather than slices that have to be converted which may sidestep this issue for some people. If this is more about supporting a large range of different attribute sets for instruments we could look at fixing this in exporters, or adding a new_unchecked style API where users have to ensure there are no dupes themselves.

@KallDrexx
Copy link
Contributor Author

prometheus won't be able to scrap successfully if there is duplication, so maybe this is something to be handled at the promethes exporter level.

Are we sure about this? The prometheus test seems pretty explicit that it will concatenate values if the keys are duplicated due to sanitation. That is tested and still works after my change. So it seems like prometheus already supports some level of duplication, this test is only making the explicit decision that duplicate keys due to sanitation are allowed but duplicate exact keys are not allowed.

@KallDrexx
Copy link
Contributor Author

Seems like there are a few places to investigate here, if the instrument add calls are with a fixed set of attributes we could look at changing the methods to accept attribute sets rather than slices that have to be converted which may sidestep this issue for some people.

That only addresses performance concerns for users that have a stable set of attributes they use a lot. This PR significantly improves performance for users who do not have a predictable set of attribute key/value combinations.

Copy link
Contributor

@shaun-cox shaun-cox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent improvement! Thanks for the contribution. Assuming other approvers/maintainers are okay with the behavior change, I wholeheartedly support from a performance perspective.

stress/src/metrics.rs Outdated Show resolved Hide resolved
@cijothomas
Copy link
Member

cijothomas commented Nov 17, 2023

prometheus won't be able to scrap successfully if there is duplication, so maybe this is something to be handled at the promethes exporter level.

Are we sure about this? The prometheus test seems pretty explicit that it will concatenate values if the keys are duplicated due to sanitation. That is tested and still works after my change. So it seems like prometheus already supports some level of duplication, this test is only making the explicit decision that duplicate keys due to sanitation are allowed but duplicate exact keys are not allowed.

I tested OTel .NET with actual Prometheus to see how would Prometheus react if de-dedup is not done. Below are my findings.

Following is an output Prometheus is happily scrapping successfully:

# TYPE myCounter_total counter
# HELP myCounter_total A counter for demonstration purpose.
myCounter_total{color="red",name="apple"} 2999.700000000016 1700256532355
myCounter_total{color="yellow",name="lemon"} 30269.70000000016 1700256532355

# EOF

This was generated using below

Counter.Add(9.9, new("name", "apple"), new("color", "red"));
Counter.Add(99.9, new("name", "lemon"), new("color", "yellow"));

Now the following is something prometheus is failing to scrape with error message "label name "name" is not unique: invalid sample"

# TYPE myCounter_total counter
# HELP myCounter_total A counter for demonstration purpose.
myCounter_total{color="red",name="orange",name="apple"} 11375.099999999758 1700256656251
myCounter_total{color="yellow",name="lemon"} 114785.09999999781 1700256656251

# EOF

This was generated using

Counter.Add(9.9, new("name", "apple"), new("name", "orange"), new("color", "red")); // no dedup done in sdk..
Counter.Add(99.9, new("name", "lemon"), new("color", "yellow"));

In case of .NET, neither SDK, not prometheus exporter is deduping, so prometheus scrape is failing.
For OTel Rust, we need to decide if we are okay with similar behavior as OTel .NET OR we should do de-dedup in PrometheusExporter itself.

My personal vote is to not do dedup in prometheus now, and get back to this based on feedback. If there are users affected by this, we should fix prom exporter.

Main point being - lets remove dedup from SDK and work towards the goals!

@cijothomas
Copy link
Member

# TYPE myCounter_total counter
# HELP myCounter_total A counter for demonstration purpose.
myCounter_total{color="red",name="orange",name="apple"} 11375.099999999758 1700256656251
myCounter_total{color="yellow",name="lemon"} 114785.09999999781 1700256656251

can prometheus scrape this correctly?

Duplicate Label Names: In the first metric, myCounter_total{color="red",name="orange",name="apple"}, you have used the label name name twice. Prometheus expects each label name within a metric to be unique. Having duplicate label names is not valid in Prometheus and would likely cause an error during scraping.

Should have just asked AI 🤖

@hdost
Copy link
Contributor

hdost commented Nov 18, 2023

In case of .NET, neither SDK, not prometheus exporter is deduping, so prometheus scrape is failing.
For OTel Rust, we need to decide if we are okay with similar behavior as OTel .NET OR we should do de-dedup in PrometheusExporter itself.

.NET has a bad Prometheus exporter implementation ? I wouldn't think we should replicate that 🤕. I agree we want to get good performance, but we should also ensure a proper exporter output. This could be offloaded to the scrape time.

@KallDrexx
Copy link
Contributor Author

KallDrexx commented Nov 18, 2023

It's not clear to me why deduping should be opentelemetry's concern rather than a concern of the calling application.

What happens when deduping occurs? Should the last value be kept? The first? What is last and first in different scenarios since all attributes are passed in at the same time?

If the calling application passes in the duplicate attributes in different order in different scenarios then deduplication logic will split them into different time series, causing hard to debug and wrong metrics.

Only the consuming application knows how deduping should accurately be done imo

@hdost
Copy link
Contributor

hdost commented Nov 18, 2023

While we don't have to follow the limits there are clear guidelines for how to deal with attributes https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/common/README.md#attribute-limits
If we want to offload the work of deduplication to the exporter, then we will need to preserve the order in which they were added.

For metrics the identity is very important to how values are aggregated.

@hdost
Copy link
Contributor

hdost commented Nov 18, 2023

Also I do want to say @KallDrexx thank you for looking into this it's very much appreciated. I apologize if I came if a bit strong to start.

@hdost
Copy link
Contributor

hdost commented Nov 18, 2023

In a curiosity vein, for your specific use case do you have skeleton you could provide. Would it be possible to have a cached set of attributes or is it entirely dynamic for your network counter?

@KallDrexx
Copy link
Contributor Author

Those guidelines are far from clear in this scenario as far as I can tell. Not only do they state that those guidelines aren't really applicable to metrics, Those dictate how truncation should happen if there is a finite number of allowed attributes which isn't the case here.

For metrics the identity is very important to how values are aggregated.

Which is 100% why I think the library doing arbitrary deduplication is bad.

If the consumer calls counter.add(5, {color: red, shape: circle}) what are they semantically trying to communicate? Based on my experience in past observability projects (not opentelemetry specific) they ware looking at incrementing the counter for the intersection of all attributes specified.

So as a concrete example calling

counter.add(5, {color: red, shape: square});
counter.add(5, {color: red, shape: circle});

In my metrics UI I should see 10 for color: red, 5 for shape: square, 5 for shape: circle, 5 for shape: circle, color: red and 0 for shape: circle, color: blue. The latter one is especially important because I can clearly tell that all the circles that were incremented were red and none were blue.

So what happens if the user calls counter.add(5, {color: red, color: blue}). Semantically this means that they only want 5 added to the metrics where you are search for both red and blue, which is a totally different bucket than red and blue by themselves.

So truncating now means that counter.add(5, {color: red, color: blue}) is equivalent to counter.add(5, {color: red}) which has a very different semantic meaning imo.

This is not that crazy to me , because that's exactly how the prometheus exporter test works today. If you look at the test you see that it's assuming that if C.D and C/D are provided, they should get concatenated into a single C_D with the final attribute value being Y;Z. While I don't have back story into that test (I didn't write it and am not a prometheus expert), my assumption is that this will get aggregated by prometheus as a single Y;Z value that shows up in filters, which is the only meaningful way to filter on values that are an intersection of two types of the same dimension.

This leads me to believe that the current prometheus code is correct. If you call counter.add(5, {color: red, color: blue}) then the current prometheus exporter will export color:red;blue, which while will bloat the UI is the only way to make the metrics correctly identifiable.

That's why I believe that the caller of the SDK should do any de-duplication that involves complete removal of attributes, because they are the only one who know the valid way to do so.

So my next question is how do other opentelemetry SDKs handle this?

  • Python's add call requires a dictionary to be put in, explicitly forcing the consumer of the SDK to do any de-duplication themselves.
  • Java's library does de-dupe attributes but does a last one wins, which is different than rust's code today and the guidelines posted above.
  • C++ does not seem to make any attempt to filter duplicate keys in their iterables (the interface doesn't specify it and the code I found doesn't do it
  • I'm not totally sure what's going on in Go but it doesn't appear that it's doing any filtering from the browsing I did
  • Javascripts' library is expecting a javascript object to be passed in by attributes, which means it's expecting the caller to have have no-duplicates already when passed in (or handled duplicate keys themselves)

So every language library seems to have different considerations and no one follows that guideline you posted specifically. Maybe that's a bad thing and the rust's current implementation is doing things right and everyone else is wrong, but is it ok to be the only one doing things "right" if it causes a performance hit?

@KallDrexx
Copy link
Contributor Author

In a curiosity vein, for your specific use case do you have skeleton you could provide. Would it be possible to have a cached set of attributes or is it entirely dynamic for your network counter?

Our application is working with metrics for network I/O on a per packet basis, which is why we encountered the AttributeSet::from() costs. I do plan on looking at adding a bounded instrument API to not have to bear the cost of creating attribute sets as often, but we will have to create new attribute sets regularly at some points, so I wanted to ensure that that code path was still fast.

@cijothomas
Copy link
Member

In case of .NET, neither SDK, not prometheus exporter is deduping, so prometheus scrape is failing.
For OTel Rust, we need to decide if we are okay with similar behavior as OTel .NET OR we should do de-dedup in PrometheusExporter itself.

.NET has a bad Prometheus exporter implementation ? I wouldn't think we should replicate that 🤕. I agree we want to get good performance, but we should also ensure a proper exporter output. This could be offloaded to the scrape time.

You are right! It could be treated as a bug in OTel .NET's Prometheus Exporter. OTel .NET's Prometheus is an experimental component, and there is a chance it'll never be stable released at all (in favor of native OTLP support in Prometheus)

And I tried how would Prometheus's native OTLP ingestion deals with duplication! Surprised with the result! I send it MyFruitCounter.Add(1, new("name", "apple"), new("name", "orange"), new("color", "red"));, via OTLP in OTel .NET where we know neither the SDK nor OTLP Exporter is deduping, and Prometheus stored it as MyFruitCounter_total{color="red", job="unknown_service:getting-started-prometheus-grafana", name="orange;apple"}

@hdost Any ideas how should OTel Rust prometheus handles duplicate attributes? Should we send "name=orange", with a policy of "last-one-wins", OR should we do "name=orange;apple", just like Prometheus itself is doing?

@hdost
Copy link
Contributor

hdost commented Nov 18, 2023

For metrics the identity is very important to how values are aggregated.

Which is 100% why I think the library doing arbitrary deduplication is bad.

Well I would agree that arbitrary dedup is bad and I would say what I said before in that we could have the exporter (the client from my perspective) have settings which could control how dedup is handled. It also wouldn't put the performance hit on the writer but on the subscriber.

That's what I was trying to say here:

If we want to offload the work of deduplication to the exporter, then we will need to preserve the order in which they were added.

Whatever we do for resolving Prometheus specifically requires uniqueness of labels, so whether to do it is not the question, but how. And there are Processors and Views which can be utilized to make a determination as to how this data is processed. It's the SDK which enables it. And to provide flexibility to users we could certainly expand that.

That's why I believe that the caller of the SDK should do any de-duplication that involves complete removal of attributes, because they are the only one who know the valid way to do so.

So every language library seems to have different considerations and no one follows that guideline you posted specifically. Maybe that's a bad thing and the rust's current implementation is doing things right and everyone else is wrong, but is it ok to be the only one doing things "right" if it causes a performance hit?

I'm not going to die on the hill of right vs performance, but I think this comes from me not explaining my position well enough. As that's not what I was arguing.

I would also say whatever "right" is should be consistent between implementations, otherwise what's the point of having a standard, but that's going to be decided here. If we decide to go against spec then that's why we have #1297

@KallDrexx
Copy link
Contributor Author

I will also mention that if we really want duplication logic as part of AttributeSet::from() then I can look at adding that in. My suspicion is the majority of the performance gain is from switching from BTreeSet to Vec. I'd rather not add the cost in if it's not needed.

I'm not necessarily opposed to doing deduplication of keys on the exporting side (which I think may be what you are aiming for). However to me it seems like:

  • OpenTelemetry spec does not specify this behavior as far as I've been able to find
  • Every implementation that I've been able to find seems to differ.
    • Some APIs filter duplicate attribute keys on creation of the attribute set (rust right now is first wins, java is last wins)
    • Some APIs have API boundaries that make it impossible to submit duplicate keys in the first place (python, javascript)
    • Some APIs seem to do no filtering at attribute set time (c++, go I think?)

I haven't looked at the exporter code in these projects yet but it seems like there's no consistency between implementations currently. So I'm not sure it's actually a deviation from the spec.

@KallDrexx
Copy link
Contributor Author

If we want to offload the work of deduplication to the exporter, then we will need to preserve the order in which they were added.

I guess "preserving the order in which they are added" is an interesting concept. For example

counter.add(1, {color: red, color: blue});
counter.add(1, {color: blue, color: red});

THis is actually technically an edge case in my PR as well. But should these be in the same time series or different time series? If the same, should order be preserved at export time?

@KallDrexx
Copy link
Contributor Author

KallDrexx commented Nov 20, 2023

This sounds very promising, do you want to make this one as a specific separate PR for comparison ?

I went ahead and updated this PR, because while precomputing the hash is new, I still did it in the vein of storing the attributes in a Vec<HashedKeyValue> (though I probably don't need HashKeyValue instead of just KeyValue since it's precomputed).

The Vec<T> is still going to be faster than a BTreeSet as the BTreeSet does more work to sort on every insert instead of once at the end (among other Vec optimizations for small lists).

So the new changes that include deduplication last-wins logic now passes all tests as can be seen on CI. I still want to do some manual verification since we don't have good test coverage over attributes.

The current stress test values are:

mshapiro@matt-ms-hpz:~/code/opentelemetry-rust/stress$ cargo run --release --bin metrics
   Compiling opentelemetry_sdk v0.21.1 (/home/mshapiro/code/opentelemetry-rust/opentelemetry-sdk)
   Compiling opentelemetry-stdout v0.2.0 (/home/mshapiro/code/opentelemetry-rust/opentelemetry-stdout)
   Compiling opentelemetry-appender-tracing v0.2.0 (/home/mshapiro/code/opentelemetry-rust/opentelemetry-appender-tracing)
   Compiling stress v0.1.0 (/home/mshapiro/code/opentelemetry-rust/stress)
    Finished release [optimized] target(s) in 6.13s
     Running `/home/mshapiro/code/opentelemetry-rust/target/release/metrics`
Number of threads: 6
Throughput: 6,290,200 iterations/sec
Throughput: 6,562,400 iterations/sec
Throughput: 6,672,400 iterations/sec
Throughput: 6,636,600 iterations/sec
Throughput: 6,373,800 iterations/sec
Throughput: 6,460,800 iterations/sec
^C
mshapiro@matt-ms-hpz:~/code/opentelemetry-rust/stress$ cargo run --release --bin metrics
    Finished release [optimized] target(s) in 0.08s
     Running `/home/mshapiro/code/opentelemetry-rust/target/release/metrics`
Number of threads: 6
Throughput: 6,304,000 iterations/sec
Throughput: 5,937,600 iterations/sec
Throughput: 5,824,600 iterations/sec
Throughput: 6,308,600 iterations/sec
Throughput: 6,110,200 iterations/sec
^C
mshapiro@matt-ms-hpz:~/code/opentelemetry-rust/stress$ cargo run --release --bin metrics
   Compiling opentelemetry_sdk v0.21.1 (/home/mshapiro/code/opentelemetry-rust/opentelemetry-sdk)
   Compiling opentelemetry-stdout v0.2.0 (/home/mshapiro/code/opentelemetry-rust/opentelemetry-stdout)
   Compiling opentelemetry-appender-tracing v0.2.0 (/home/mshapiro/code/opentelemetry-rust/opentelemetry-appender-tracing)
   Compiling stress v0.1.0 (/home/mshapiro/code/opentelemetry-rust/stress)
    Finished release [optimized] target(s) in 9.07s
     Running `/home/mshapiro/code/opentelemetry-rust/target/release/metrics`
Number of threads: 6
Throughput: 1,248,800 iterations/sec
Throughput: 1,296,000 iterations/sec
Throughput: 1,266,400 iterations/sec
Throughput: 1,291,800 iterations/sec
Throughput: 1,306,200 iterations/sec
Throughput: 1,267,200 iterations/sec
Throughput: 1,246,800 iterations/sec
Throughput: 1,228,200 iterations/sec
^C
mshapiro@matt-ms-hpz:~/code/opentelemetry-rust/stress$ cargo run --release --bin metrics
    Finished release [optimized] target(s) in 0.07s
     Running `/home/mshapiro/code/opentelemetry-rust/target/release/metrics`
Number of threads: 6
Throughput: 1,092,000 iterations/sec
Throughput: 1,102,000 iterations/sec
Throughput: 1,075,800 iterations/sec
Throughput: 1,102,000 iterations/sec
^C
mshapiro@matt-ms-hpz:~/code/opentelemetry-rust/stress$ Throughput: 971,400 iterations/sec
cargo run --release --bin metrics
   Compiling opentelemetry_sdk v0.21.1 (/home/mshapiro/code/opentelemetry-rust/opentelemetry-sdk)
   Compiling opentelemetry-appender-tracing v0.2.0 (/home/mshapiro/code/opentelemetry-rust/opentelemetry-appender-tracing)
   Compiling opentelemetry-stdout v0.2.0 (/home/mshapiro/code/opentelemetry-rust/opentelemetry-stdout)
   Compiling stress v0.1.0 (/home/mshapiro/code/opentelemetry-rust/stress)
    Finished release [optimized] target(s) in 6.03s
     Running `/home/mshapiro/code/opentelemetry-rust/target/release/metrics`
Number of threads: 6
Throughput: 6,162,400 iterations/sec
Throughput: 5,684,200 iterations/sec
Throughput: 5,232,000 iterations/sec
Throughput: 5,629,600 iterations/sec
Throughput: 5,640,200 iterations/sec
^C

The first two runs and the final run are from the commit in this PR as of now. The middle two runs are from the main branch (with the stress test modified to match the out of order sorting my PR changes).

Averaging these iteration/sec values gives main branch 1,210,266 iterations per second while my PR provided 6,168,066 iterations per second, a 510% improvement.

To be sure where the performance difference was coming from, precomputing the hashes with a BTreeSet gave a speedup but only got up to about 2,959,560 iterations per second, which is far below what I'm getting for the Vec implementation.

I also tested this change against our application which encountered metrics being a bottleneck for us. One function is ending up particularly heavy for us which is incrementing two counters for every network packet we get. This brought the total CPU usage for that function from 11.93% down to 8.24%. While I need to look a bit more at it, a significant amount of the performance gain ended up coming from improved call times to Measure::call(), presumably due to decreased hashing cost on lookups.

@hdost
Copy link
Contributor

hdost commented Nov 20, 2023

Thank you again very much for working on this!

@KallDrexx
Copy link
Contributor Author

Ok I've double checked everything I wanted to, and things look good. I had to modify the stress test a tiny bit (lower the number of attribute values) since 4 attributes was causing us to go over the 2k cardinality limit.

I've updated the PR description with the final benchmarks from my machine. Would be nice if someone is able to verify similar numbers.

Interestingly enough, I did an experiment reducing the deduplication logic and ended up with consistently lower throughput (only by ~100k-200k iterations per second). That doesn't quite make sense to me, especially since it's not actually detecting duplicates in the stress test.

So from my perspective this PR is ready for review.

Again, sorry for all the back and forth that went on while I did exploration :)

@cijothomas
Copy link
Member

Would be nice if someone is able to verify similar numbers.

My laptop is giving a diff result:
Throughput: 4,471,200 iterations/sec -- main branch with the stress test from this PR
Throughput: 5,492,000 iterations/sec -- this PR branch

@lalitb
Copy link
Member

lalitb commented Nov 21, 2023

| Would be nice if someone is able to verify similar numbers.

This is what I see (8 cores, wsl2 32GB)

Current PR: ~6,286,000 iterations/sec
Main branch (with stress test from current PR): ~2,592,000 iterations/sec

@cijothomas
Copy link
Member

Would be nice if someone is able to verify similar numbers.

My laptop is giving a diff result: Throughput: 4,471,200 iterations/sec -- main branch with the stress test from this PR Throughput: 5,492,000 iterations/sec -- this PR branch

another machine:

Throughput: 2,257,200 iterations/sec
Throughput: 2,236,200 iterations/sec
Throughput: 2,476,200 iterations/sec

vs

Throughput: 5,180,000 iterations/sec
Throughput: 5,242,400 iterations/sec
Throughput: 5,289,600 iterations/sec

Copy link
Member

@jtescher jtescher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one last issue and a nit, other than that looks good!

opentelemetry-sdk/src/attributes/set.rs Outdated Show resolved Hide resolved
opentelemetry-sdk/src/attributes/set.rs Show resolved Hide resolved
@cijothomas
Copy link
Member

Interestingly enough, I did an experiment reducing the deduplication logic and ended up with consistently lower throughput (only by ~100k-200k iterations per second). That doesn't quite make sense to me, especially since it's not actually detecting duplicates in the stress test.

I see otherwise! Removing the de-dup caused another 10% boost for me... (I cannot think of any reason why it showed lower throughput for you! Could you try re-running?

#[derive(Clone, Default, Debug, Hash, PartialEq, Eq)]
pub struct AttributeSet(BTreeSet<HashKeyValue>);
#[derive(Clone, Default, Debug, PartialEq, Eq)]
pub struct AttributeSet(Vec<HashKeyValue>, u64);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to see how caching the hash is improve speed.. I believe this is because it can now be re-used when we lookup timeseries in here?

It'd help if we were needing the hash calculation more than once. But we don't need it for more than once.. So in this PR, the hashcalculation is simply moved from here to the AttributeSet creation, so I am not able to see how that would have helped with the perf improvement!

Or the perf gains are purely from the Btee->Vec switch?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other words, we should not make this change, as taking up more memory to store/cache the hash only benefits if there lookup/insert are needed more than once, and from my understanding, we don't have the lookup-more-than-once requirement..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Putting a breakpoint on Hash shows there are several hidden hash calls within the hashmap. For example:

image

So when the hash map grows, it has to re-hash all the time series in it. So this maybe why we see a 500% increase in stress test performance but a bit muted (but still impactful) performance improvement in real life applications. Precaching hashes improves this test case because we have 1k+ time series that sometimes need to be rehashed.

Although eventually that should stabilize, and after a few thousand iterations I no longer see reserve calls requesting hashes, but I don't see a significant change in throughput.

I did the following two runs:

mshapiro@matt-ms-hpz:~/code/opentelemetry-rust$ cargo run --release --package stress --bin metrics
   Compiling opentelemetry_sdk v0.21.1 (/home/mshapiro/code/opentelemetry-rust/opentelemetry-sdk)
   Compiling opentelemetry-stdout v0.2.0 (/home/mshapiro/code/opentelemetry-rust/opentelemetry-stdout)
   Compiling opentelemetry-appender-tracing v0.2.0 (/home/mshapiro/code/opentelemetry-rust/opentelemetry-appender-tracing)
   Compiling stress v0.1.0 (/home/mshapiro/code/opentelemetry-rust/stress)
    Finished release [optimized] target(s) in 5.88s
     Running `target/release/metrics`
Number of threads: 6
Throughput: 5,525,800 iterations/sec
Throughput: 5,796,800 iterations/sec
Throughput: 5,553,600 iterations/sec
Throughput: 5,823,200 iterations/sec
Throughput: 5,942,600 iterations/sec
Throughput: 5,609,600 iterations/sec
^C
mshapiro@matt-ms-hpz:~/code/opentelemetry-rust$ cargo run --release --package stress --bin metrics
   Compiling opentelemetry_sdk v0.21.1 (/home/mshapiro/code/opentelemetry-rust/opentelemetry-sdk)
   Compiling opentelemetry-appender-tracing v0.2.0 (/home/mshapiro/code/opentelemetry-rust/opentelemetry-appender-tracing)
   Compiling opentelemetry-stdout v0.2.0 (/home/mshapiro/code/opentelemetry-rust/opentelemetry-stdout)
   Compiling stress v0.1.0 (/home/mshapiro/code/opentelemetry-rust/stress)
    Finished release [optimized] target(s) in 5.86s
     Running `target/release/metrics`
Number of threads: 6
Throughput: 1,304,400 iterations/sec
Throughput: 1,314,800 iterations/sec
Throughput: 1,314,600 iterations/sec
Throughput: 1,158,400 iterations/sec
Throughput: 1,300,200 iterations/sec
Throughput: 1,332,600 iterations/sec
Throughput: 1,333,200 iterations/sec

The first run is with this PR as is, the 2nd run is commenting out the impl Hash block and adding Hash to the derive() attribute.

So pre-hashing is definitely providing an improvement on its own as far as Ican tell. This is backed up by profiling our actual application, as we see Measure::call() percentages go from 4.66% of our CPU time down to 2.30%. I'm also pretty sure we saw a reduction in hash_one percentages but I didn't write them down in my notes.

So I'm not totally sure where the extra hashing is coming from, but it does seem to add some benefits in real use cases.

Copy link
Member

@lalitb lalitb Nov 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Looking into the code now I don't see caching is going to help here. Even for the slight improvement, the implication of storing the cache could be considerable. We have a slightly different design in otel-cpp which helps as the lookup is not purely O(1) in the hot path.

Ok, ignore it, the above comment by @KallDrexx was added around the same time as this.
My further comment based on the reply -
It seems that the pre-compute hash cache is useful in the case of rehashing. If the HashMap is pre-allocated with a certain capacity ( based on the supported upper cardinality limit) and does not need to be resized, will that avoid re-hashing, and so the need of cache.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So when the hash map grows, it has to re-hash all the time series in it. So this maybe why we see a 500% increase in stress test performance but a bit muted (but still impactful) performance improvement in real life applications. Precaching hashes improves this test case because we have 1k+ time series that sometimes need to be rehashed.

We need to prevent this nonetheless. We can ensure HashMap has enough size by initializing it correctly, so rehashed would be avoided. We know the max size (default 2000), so this should be relatively easy to achieve! Let us address this separately, as that is another nice improvement by avoiding reallocations!

My stress test show no change when I removed Hash custom implementation!

Copy link
Contributor Author

@KallDrexx KallDrexx Nov 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As another data point, I changed ValueMap::new() to do HashMap::with_capacity(2000) to reduce resizing and rehashing. That made the first iteration report to be slightly better than vanilla main, but not by much. Every throughput report afterwards was within the same variance I get with main.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rehashing impact cannot be tested using the current stress test, as there is no collect() triggered so the hashmap is never emptied, so after the initial fill-up, it wont need re-hash!

Here's what I suggest:

  1. We know from the tests that btree -> vec is definitely perf booster, so lets scope this PR to do just that.
  2. For rehashing prevention, cache-the-hash, and avoid de-duplication logic, avoid sorting and more - lets do it separately on after other, so each PR is very focused and demonstrates clean perf gains.
    (This PR already kick started a ton of ideas to improve perf!)

Thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not opposed to splitting it out. I can do that tomorrow.

Copy link
Contributor Author

@KallDrexx KallDrexx Nov 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok it's 12:30am, I went to go to sleep and it finally dawns on me why precalculating the hashing improves my throughput on multiple machines more than others. So I had to get out my laptop to verify, and yes, I know what's going on now.

You are correct that we are only computing the hash once, but where we compute the hash is important! Precomputing the hash the way I am in this PR (that I'm going to put in its own PR) incurs the cost at AttributeSet::From() time. Not precomputing the hash incurs the cost at Measure::call() time.

This is important because incurring the cost during the hash lookup means we are doing more work inside the mutex lock, and thus delaying all other Measure::call() operations. Precomputing the hash allows each counter to compute the hash in parallel, and thus less work inside the lock.

This is why we were seeing such variance of if it was helping or not. It all came down to thread counts for the throughput stress test. I verified this by testing with and without precomputed hashes with 6 threads on my PC, then with and without with 2 threads (fyi, 1 thread is broken).

On my personal laptop (and on battery):

  • 6 threads precomputed: 3.5M
  • 6 threads non-precomputed: 2.3M
  • 2 threads prcomputed: 2.5M
  • 2 threads non-precomputed: 2.5M

So precomputing the hash is a concurrency optimization.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! It makes sense. Lets tackle it in the follow up PR.

If I learned anything over the last few years working in OTel - it is that there is no magic! Everything should be explainable. Thanks a lot!

Copy link
Member

@lalitb lalitb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Suggestion to remove the cache for hash, and take it as separate enhancement. There needs to be tradeoff between perf improvement, and storage allocation, and probably needs more investigation.

@@ -33,14 +31,17 @@ fn test_counter() {
let index_first_attribute = rng.gen_range(0..len);
let index_second_attribute = rng.gen_range(0..len);
let index_third_attribute = rng.gen_range(0..len);
let index_fourth_attribute = rng.gen_range(0..len);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets not make change to the stress test itself in this PR. We can drive changes to stress test as a separate PR if we believe it should do unsorted order. (I do not believe it should, but want to avoid adding noise to this PR).

Copy link
Member

@cijothomas cijothomas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with the replacement of Btree with Vec.! This is moving us in the right direction. Many thanks for doing that, and all the other productive discussions in this PR. I hope the next set of PRs will take us even closer to the perf goals!

Requesting changes only to make sure the following is done before merge:

  1. Move the hash caching to a separate PR, after we demonstrate it is needed.
  2. Revert changes to stress test.

I opened #1389 to get more focused benchmark. Could you run that and update the PR description with the results? I also suggest re-run the stress tests as we don't see the 5x improvement in other machines!

@KallDrexx
Copy link
Contributor Author

Updated the PR to remove the changes to the stress test, and remove caching of the hash. I will open up new PRs for these separately.

Will update the PR description with updated benchmarks and throughput values

Copy link
Member

@cijothomas cijothomas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thank you!

@cijothomas cijothomas merged commit 075d26c into open-telemetry:main Nov 22, 2023
15 checks passed
@KallDrexx KallDrexx deleted the attribute_sets branch November 27, 2023 15:56
@KallDrexx KallDrexx restored the attribute_sets branch November 27, 2023 17:08
@KallDrexx KallDrexx deleted the attribute_sets branch November 27, 2023 17:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants