-
Notifications
You must be signed in to change notification settings - Fork 24.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add microbenchmark for LongKeyedBucketOrds #58608
Conversation
I've always been confused by the strange behavior that I saw when working on elastic#57304. Specifically, I saw switching from a bimorphic invocation to a monomorphic invocation to give us a 7%-15% performance bump. This felt *bonkers* to me. And, it also made me wonder whether it'd be worth looking into doing it everywhere. It turns out that, no, it isn't needed everywhere. This benchmark shows that a bimorphic invocation like: ``` LongKeyedBucketOrds ords = new LongKeyedBucketOrds.ForSingle(); ords.add(0, 0); <------ this line ``` is 19% slower than a monomorphic invocation like: ``` LongKeyedBucketOrds.ForSingle ords = new LongKeyedBucketOrds.ForSingle(); ords.add(0, 0); <------ this line ``` But *only* when the reference is mutable. In the example above, if `ords` is never changed then both perform the same. But if the `ords` reference is assigned twice then we start to see the difference: ``` immutable bimorphic avgt 10 6.468 ± 0.045 ns/op immutable monomorphic avgt 10 6.756 ± 0.026 ns/op mutable bimorphic avgt 10 9.741 ± 0.073 ns/op mutable monomorphic avgt 10 8.190 ± 0.016 ns/op ``` So the conclusion from all this is that we've done the right thing: `auto_date_histogram` is the only aggregation in which `ords` isn't final and it is the only aggregation that forces monomorphic invocations. All other aggregations use an immutable bimorphic invocation. Which is fine. Relates to elastic#56487
Pinging @elastic/es-analytics-geo (:Analytics/Aggregations) |
Huh, interesting. I guess with the bimorphic case, the JVM is failing to recognize that it's still actually monomorphic despite being mutable, so probably falling back to a dynamic dispatch table or something? JVM black magic I suppose :) Afraid my micro-benchmark-fu is very weak though, so not really sure I can review this haha :) Should we grab someone from the performance team to give it a quick skim? |
I did a little digging with
but the immutable version is:
I'm almost certainly doing this wrong, but it looks like the immutable call site is inlined, but the mutable one is has the extra Now my benchmark is a little different because our loop is a tighter than the one from lucene and because I'm using a final variable in the method instead of on an object. I expect that the second one is a distinction without a difference. And the first one, well, Lucene seems to call our collectors pretty darn quick. But I still don't think this is all of it. I do expect to do a little more work on these data structures in the future and to use the microbenchmark to sound out performance changes to them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to Zach, I am unclear under which metrics to review a benchmark, but I do
have some questions and comments!
Should this be under the benchmarks package, not within server?
I've wanted to add benchmarks in the past, and was told the only ones we have committed are there to be used as examples. Your comment below:
I do expect to do a little more work on these data structures in the future and to use the microbenchmark to sound out performance changes to them.
convinces me that we should record this information somewhere, so why not merge it.
I've added @danielmitterdorfer, yeah.
It's in the benchmarks project. I just stuck it in the same package as the code without thinking about it. I'll rename the package.
I'd never heard that before! I figure if we can get useful data from the benchmarks now then committing them will make sure that we can at least compile them the next time we need them. |
woops, apologize. missed that. carry on! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the benchmarks are prone to dead code eliminitation. I suggest to return ords
so it is consumed.
* because it is not needed. | ||
*/ | ||
@Benchmark | ||
public void singleBucketIntoSingleImmutableMonmorphicInvocation() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This benchmark is prone to an optimization called dead code elimination because you neither return anything nor you feed anything into JMH's Blackhole
.
* Emulates the way that most aggregations use {@link LongKeyedBucketOrds}. | ||
*/ | ||
@Benchmark | ||
public void singleBucketIntoSingleImmutableBimorphicInvocation() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This benchmark is prone to an optimization called dead code elimination because you neither return anything nor you feed anything into JMH's Blackhole
.
* Emulates the way that {@link AutoDateHistogramAggregationBuilder} uses {@link LongKeyedBucketOrds}. | ||
*/ | ||
@Benchmark | ||
public void singleBucketIntoSingleMutableMonmorphicInvocation() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This benchmark is prone to an optimization called dead code elimination because you neither return anything nor you feed anything into JMH's Blackhole
.
* {@link #singleBucketIntoSingleMutableMonmorphicInvocation() monomorphic invocation}. | ||
*/ | ||
@Benchmark | ||
public void singleBucketIntoSingleMutableBimorphicInvocation() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This benchmark is prone to an optimization called dead code elimination because you neither return anything nor you feed anything into JMH's Blackhole
.
* aggregation and there is only a single value for that term in the index. | ||
*/ | ||
@Benchmark | ||
public void singleBucketIntoMulti() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This benchmark is prone to an optimization called dead code elimination because you neither return anything nor you feed anything into JMH's Blackhole
.
* Emulates an aggregation that collects from many buckets. | ||
*/ | ||
@Benchmark | ||
public void multiBucket() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This benchmark is prone to an optimization called dead code elimination because you neither return anything nor you feed anything into JMH's Blackhole
.
Good call! |
I pushed a couple of updates. The results are much the same but the bimorphic mutable invocation has grown a little more efficient. I'm not sure why, but I'll take it. I still see a difference, but it is less pronounced:
Its now less than 10%. The performance testing that I did on the agg that these are emulating showed at least a 7% difference, which is pretty substantial. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left one suggestion but LGTM otherwise. No need for another review round.
*/ | ||
@Benchmark | ||
public void singleBucketIntoSingleImmutableMonmorphicInvocation(Blackhole bh) { | ||
forceLoadClasses(bh); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume you're not interested in measuring class loading so you can move this code out of the measurement loop and into a separate setup method:
@Setup
public void setUp(Blackhole bh) {
// you can also inline this method now that there is only one call site
forceLoadClasses(bh);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Thanks for reviewing @danielmitterdorfer ! |
I've always been confused by the strange behavior that I saw when working on elastic#57304. Specifically, I saw switching from a bimorphic invocation to a monomorphic invocation to give us a 7%-15% performance bump. This felt *bonkers* to me. And, it also made me wonder whether it'd be worth looking into doing it everywhere. It turns out that, no, it isn't needed everywhere. This benchmark shows that a bimorphic invocation like: ``` LongKeyedBucketOrds ords = new LongKeyedBucketOrds.ForSingle(); ords.add(0, 0); <------ this line ``` is 19% slower than a monomorphic invocation like: ``` LongKeyedBucketOrds.ForSingle ords = new LongKeyedBucketOrds.ForSingle(); ords.add(0, 0); <------ this line ``` But *only* when the reference is mutable. In the example above, if `ords` is never changed then both perform the same. But if the `ords` reference is assigned twice then we start to see the difference: ``` immutable bimorphic avgt 10 6.468 ± 0.045 ns/op immutable monomorphic avgt 10 6.756 ± 0.026 ns/op mutable bimorphic avgt 10 9.741 ± 0.073 ns/op mutable monomorphic avgt 10 8.190 ± 0.016 ns/op ``` So the conclusion from all this is that we've done the right thing: `auto_date_histogram` is the only aggregation in which `ords` isn't final and it is the only aggregation that forces monomorphic invocations. All other aggregations use an immutable bimorphic invocation. Which is fine. Relates to elastic#56487
I've always been confused by the strange behavior that I saw when working on #57304. Specifically, I saw switching from a bimorphic invocation to a monomorphic invocation to give us a 7%-15% performance bump. This felt *bonkers* to me. And, it also made me wonder whether it'd be worth looking into doing it everywhere. It turns out that, no, it isn't needed everywhere. This benchmark shows that a bimorphic invocation like: ``` LongKeyedBucketOrds ords = new LongKeyedBucketOrds.ForSingle(); ords.add(0, 0); <------ this line ``` is 19% slower than a monomorphic invocation like: ``` LongKeyedBucketOrds.ForSingle ords = new LongKeyedBucketOrds.ForSingle(); ords.add(0, 0); <------ this line ``` But *only* when the reference is mutable. In the example above, if `ords` is never changed then both perform the same. But if the `ords` reference is assigned twice then we start to see the difference: ``` immutable bimorphic avgt 10 6.468 ± 0.045 ns/op immutable monomorphic avgt 10 6.756 ± 0.026 ns/op mutable bimorphic avgt 10 9.741 ± 0.073 ns/op mutable monomorphic avgt 10 8.190 ± 0.016 ns/op ``` So the conclusion from all this is that we've done the right thing: `auto_date_histogram` is the only aggregation in which `ords` isn't final and it is the only aggregation that forces monomorphic invocations. All other aggregations use an immutable bimorphic invocation. Which is fine. Relates to #56487
I've always been confused by the strange behavior that I saw when
working on #57304. Specifically, I saw switching from a bimorphic
invocation to a monomorphic invocation to give us a 7%-15% performance
bump. This felt bonkers to me. And, it also made me wonder whether
it'd be worth looking into doing it everywhere.
It turns out that, no, it isn't needed everywhere. This benchmark shows
that a bimorphic invocation like:
is 19% slower than a monomorphic invocation like:
But only when the reference is mutable. In the example above, if
ords
is never changed then both perform the same. But if theords
reference is assigned twice then we start to see the difference:
So the conclusion from all this is that we've done the right thing:
auto_date_histogram
is the only aggregation in whichords
isn't finaland it is the only aggregation that forces monomorphic invocations. All
other aggregations use an immutable bimorphic invocation. Which is fine.
Relates to #56487