Add microbenchmark for LongKeyedBucketOrds #58608

nik9000 · 2020-06-26T14:11:31Z

I've always been confused by the strange behavior that I saw when
working on #57304. Specifically, I saw switching from a bimorphic
invocation to a monomorphic invocation to give us a 7%-15% performance
bump. This felt bonkers to me. And, it also made me wonder whether
it'd be worth looking into doing it everywhere.

It turns out that, no, it isn't needed everywhere. This benchmark shows
that a bimorphic invocation like:

LongKeyedBucketOrds ords = new LongKeyedBucketOrds.ForSingle();
ords.add(0, 0); <------ this line

is 19% slower than a monomorphic invocation like:

LongKeyedBucketOrds.ForSingle ords = new LongKeyedBucketOrds.ForSingle();
ords.add(0, 0); <------ this line

But only when the reference is mutable. In the example above, if
ords is never changed then both perform the same. But if the ords
reference is assigned twice then we start to see the difference:

immutable bimorphic    avgt   10   6.468 ± 0.045  ns/op
immutable monomorphic  avgt   10   6.756 ± 0.026  ns/op
mutable   bimorphic    avgt   10   9.741 ± 0.073  ns/op
mutable   monomorphic  avgt   10   8.190 ± 0.016  ns/op

So the conclusion from all this is that we've done the right thing:
auto_date_histogram is the only aggregation in which ords isn't final
and it is the only aggregation that forces monomorphic invocations. All
other aggregations use an immutable bimorphic invocation. Which is fine.

Relates to #56487

I've always been confused by the strange behavior that I saw when working on elastic#57304. Specifically, I saw switching from a bimorphic invocation to a monomorphic invocation to give us a 7%-15% performance bump. This felt *bonkers* to me. And, it also made me wonder whether it'd be worth looking into doing it everywhere. It turns out that, no, it isn't needed everywhere. This benchmark shows that a bimorphic invocation like: ``` LongKeyedBucketOrds ords = new LongKeyedBucketOrds.ForSingle(); ords.add(0, 0); <------ this line ``` is 19% slower than a monomorphic invocation like: ``` LongKeyedBucketOrds.ForSingle ords = new LongKeyedBucketOrds.ForSingle(); ords.add(0, 0); <------ this line ``` But *only* when the reference is mutable. In the example above, if `ords` is never changed then both perform the same. But if the `ords` reference is assigned twice then we start to see the difference: ``` immutable bimorphic avgt 10 6.468 ± 0.045 ns/op immutable monomorphic avgt 10 6.756 ± 0.026 ns/op mutable bimorphic avgt 10 9.741 ± 0.073 ns/op mutable monomorphic avgt 10 8.190 ± 0.016 ns/op ``` So the conclusion from all this is that we've done the right thing: `auto_date_histogram` is the only aggregation in which `ords` isn't final and it is the only aggregation that forces monomorphic invocations. All other aggregations use an immutable bimorphic invocation. Which is fine. Relates to elastic#56487

elasticmachine · 2020-06-26T14:11:33Z

Pinging @elastic/es-analytics-geo (:Analytics/Aggregations)

polyfractal · 2020-06-26T14:40:16Z

Huh, interesting. I guess with the bimorphic case, the JVM is failing to recognize that it's still actually monomorphic despite being mutable, so probably falling back to a dynamic dispatch table or something? JVM black magic I suppose :)

Afraid my micro-benchmark-fu is very weak though, so not really sure I can review this haha :) Should we grab someone from the performance team to give it a quick skim?

nik9000 · 2020-06-26T16:57:49Z

Huh, interesting. I guess with the bimorphic case, the JVM is failing to recognize that it's still actually monomorphic despite being mutable, so probably falling back to a dynamic dispatch table or something? JVM black magic I suppose :)

I did a little digging with perfasm and, even though I don't know what I'm doing, like, at all, I found that the mutable version of the call goes like:

                    0x00007f80d36fd11c: ;*lrem {reexecute=0 rethrow=0 return_oop=0}
                                        ; - org.elasticsearch.search.aggregations.bucket.terms.LongKeyedBucketOrdsBenchmark::singleBucketIntoSingleMutableBimorphicInvocation@48 (line 117)
                    0x00007f80d36fd11c: d34d 2bd0 | 4d85 c90f | 84e9 0100 
                    0x00007f80d36fd128: ;*ifne {reexecute=0 rethrow=0 return_oop=0}
                                        ; - org.elasticsearch.search.aggregations.bucket.terms.LongKeyedBucketOrdsBenchmark::singleBucketIntoSingleMutableBimorphicInvocation@26 (line 113)
                    0x00007f80d36fd128: 0048 8b44 | 2408 4885 | c00f 84c2 
                    0x00007f80d36fd134: ;*aload_1 {reexecute=0 rethrow=0 return_oop=0}
                                        ; - org.elasticsearch.search.aggregations.bucket.terms.LongKeyedBucketOrdsBenchmark::singleBucketIntoSingleMutableBimorphicInvocation@42 (line 117)
                    0x00007f80d36fd134: 0200 0044 
                    0x00007f80d36fd138: ;   {metadata(&apos;org/elasticsearch/search/aggregations/bucket/terms/LongKeyedBucketOrds$FromSingle&apos;)}
                    0x00007f80d36fd138: 8b40 0841 | 81f8 25b2 | 1700 0f85 
                    0x00007f80d36fd144: ;*invokevirtual add {reexecute=0 rethrow=0 return_oop=0}
                                        ; - org.elasticsearch.search.aggregations.bucket.terms.LongKeyedBucketOrdsBenchmark::singleBucketIntoSingleMutableBimorphicInvocation@49 (line 117)
                    0x00007f80d36fd144: 1403 0000 | 448b 400c 
  0.06%             0x00007f80d36fd14c: ; implicit exception: dispatches to 0x00007f80d36fd668
  0.06%             0x00007f80d36fd14c: 4f8b 4cc4 | 104b 8d0c 
                    0x00007f80d36fd154: ;*getfield ords {reexecute=0 rethrow=0 return_oop=0}
                                        ; - org.elasticsearch.search.aggregations.bucket.terms.LongKeyedBucketOrds$FromSingle::add@21 (line 126)
                                        ; - org.elasticsearch.search.aggregations.bucket.terms.LongKeyedBucketOrdsBenchmark::singleBucketIntoSingleMutableBimorphicInvocation@49 (line 117)

but the immutable version is:

                    0x00007f938f6f87bc: ;*lrem {reexecute=0 rethrow=0 return_oop=0}
                                        ; - org.elasticsearch.search.aggregations.bucket.terms.LongKeyedBucketOrdsBenchmark::singleBucketIntoSingleImmutableBimorphicInvocation@25 (line 83)
                    0x00007f938f6f87bc: 3f4c 8b0c 
                    0x00007f938f6f87c0: ;*aload_1 {reexecute=0 rethrow=0 return_oop=0}
                                        ; - org.elasticsearch.search.aggregations.bucket.terms.LongKeyedBucketOrdsBenchmark::singleBucketIntoSingleImmutableBimorphicInvocation@19 (line 83)
                    0x00007f938f6f87c0: 2448 b867 | 6666 6666 | 6666 6648 | f7eb 458b 
                    0x00007f938f6f87d0: ;*getfield ords {reexecute=0 rethrow=0 return_oop=0}
                                        ; - org.elasticsearch.search.aggregations.bucket.terms.LongKeyedBucketOrds$FromSingle::add@21 (line 126)
                                        ; - org.elasticsearch.search.aggregations.bucket.terms.LongKeyedBucketOrdsBenchmark::singleBucketIntoSingleImmutableBimorphicInvocation@26 (line 83)

I'm almost certainly doing this wrong, but it looks like the immutable call site is inlined, but the mutable one is has the extra metadata and invokevirtual bits.

Now my benchmark is a little different because our loop is a tighter than the one from lucene and because I'm using a final variable in the method instead of on an object. I expect that the second one is a distinction without a difference. And the first one, well, Lucene seems to call our collectors pretty darn quick. But I still don't think this is all of it.

I do expect to do a little more work on these data structures in the future and to use the microbenchmark to sound out performance changes to them.

talevy

Similar to Zach, I am unclear under which metrics to review a benchmark, but I do
have some questions and comments!

Should this be under the benchmarks package, not within server?

I've wanted to add benchmarks in the past, and was told the only ones we have committed are there to be used as examples. Your comment below:

I do expect to do a little more work on these data structures in the future and to use the microbenchmark to sound out performance changes to them.

convinces me that we should record this information somewhere, so why not merge it.

nik9000 · 2020-07-06T14:09:24Z

Should we grab someone from the performance team to give it a quick skim?

I've added @danielmitterdorfer, yeah.

Should this be under the benchmarks package, not within server?

It's in the benchmarks project. I just stuck it in the same package as the code without thinking about it. I'll rename the package.

I've wanted to add benchmarks in the past, and was told the only ones we have committed are there to be used as examples.

I'd never heard that before! I figure if we can get useful data from the benchmarks now then committing them will make sure that we can at least compile them the next time we need them.

talevy · 2020-07-06T15:57:37Z

It's in the benchmarks project. I just stuck it in the same package as the code without thinking about it. I'll rename the package

woops, apologize. missed that. carry on!

danielmitterdorfer

I think the benchmarks are prone to dead code eliminitation. I suggest to return ords so it is consumed.

danielmitterdorfer · 2020-07-08T06:40:35Z

...g/elasticsearch/benchmark/search/aggregations/bucket/terms/LongKeyedBucketOrdsBenchmark.java

+     * because it is not needed.
+     */
+    @Benchmark
+    public void singleBucketIntoSingleImmutableMonmorphicInvocation() {


This benchmark is prone to an optimization called dead code elimination because you neither return anything nor you feed anything into JMH's Blackhole.

danielmitterdorfer · 2020-07-08T06:40:40Z

...g/elasticsearch/benchmark/search/aggregations/bucket/terms/LongKeyedBucketOrdsBenchmark.java

+     * Emulates the way that most aggregations use {@link LongKeyedBucketOrds}.
+     */
+    @Benchmark
+    public void singleBucketIntoSingleImmutableBimorphicInvocation() {


This benchmark is prone to an optimization called dead code elimination because you neither return anything nor you feed anything into JMH's Blackhole.

danielmitterdorfer · 2020-07-08T06:40:45Z

...g/elasticsearch/benchmark/search/aggregations/bucket/terms/LongKeyedBucketOrdsBenchmark.java

+     * Emulates the way that {@link AutoDateHistogramAggregationBuilder} uses {@link LongKeyedBucketOrds}.
+     */
+    @Benchmark
+    public void singleBucketIntoSingleMutableMonmorphicInvocation() {


This benchmark is prone to an optimization called dead code elimination because you neither return anything nor you feed anything into JMH's Blackhole.

danielmitterdorfer · 2020-07-08T06:40:53Z

...g/elasticsearch/benchmark/search/aggregations/bucket/terms/LongKeyedBucketOrdsBenchmark.java

+     * {@link #singleBucketIntoSingleMutableMonmorphicInvocation() monomorphic invocation}.
+     */
+    @Benchmark
+    public void singleBucketIntoSingleMutableBimorphicInvocation() {


This benchmark is prone to an optimization called dead code elimination because you neither return anything nor you feed anything into JMH's Blackhole.

danielmitterdorfer · 2020-07-08T06:40:58Z

...g/elasticsearch/benchmark/search/aggregations/bucket/terms/LongKeyedBucketOrdsBenchmark.java

+     * aggregation and there is only a single value for that term in the index.
+     */
+    @Benchmark
+    public void singleBucketIntoMulti() {


This benchmark is prone to an optimization called dead code elimination because you neither return anything nor you feed anything into JMH's Blackhole.

danielmitterdorfer · 2020-07-08T06:41:02Z

...g/elasticsearch/benchmark/search/aggregations/bucket/terms/LongKeyedBucketOrdsBenchmark.java

+     * Emulates an aggregation that collects from many buckets.
+     */
+    @Benchmark
+    public void multiBucket() {


This benchmark is prone to an optimization called dead code elimination because you neither return anything nor you feed anything into JMH's Blackhole.

nik9000 · 2020-07-08T19:46:37Z

I think the benchmarks are prone to dead code eliminitation. I suggest to return ords so it is consumed.

Good call!

nik9000 · 2020-07-08T20:07:33Z

I pushed a couple of updates. The results are much the same but the bimorphic mutable invocation has grown a little more efficient. I'm not sure why, but I'll take it. I still see a difference, but it is less pronounced:

Benchmark                                                                         Mode  Cnt   Score   Error  Units
LongKeyedBucketOrdsBenchmark.multiBucket                                          avgt   10  14.324 ± 0.160  ns/op
LongKeyedBucketOrdsBenchmark.singleBucketIntoMulti                                avgt   10  13.505 ± 0.105  ns/op
LongKeyedBucketOrdsBenchmark.singleBucketIntoSingleImmutableBimorphicInvocation   avgt   10   6.489 ± 0.001  ns/op
LongKeyedBucketOrdsBenchmark.singleBucketIntoSingleImmutableMonmorphicInvocation  avgt   10   6.463 ± 0.016  ns/op
LongKeyedBucketOrdsBenchmark.singleBucketIntoSingleMutableBimorphicInvocation     avgt   10   8.850 ± 0.057  ns/op
LongKeyedBucketOrdsBenchmark.singleBucketIntoSingleMutableMonmorphicInvocation    avgt   10   8.142 ± 0.049  ns/op

Its now less than 10%. The performance testing that I did on the agg that these are emulating showed at least a 7% difference, which is pretty substantial.

danielmitterdorfer

I left one suggestion but LGTM otherwise. No need for another review round.

danielmitterdorfer · 2020-07-13T08:53:48Z

...g/elasticsearch/benchmark/search/aggregations/bucket/terms/LongKeyedBucketOrdsBenchmark.java

+     */
+    @Benchmark
+    public void singleBucketIntoSingleImmutableMonmorphicInvocation(Blackhole bh) {
+        forceLoadClasses(bh);


I assume you're not interested in measuring class loading so you can move this code out of the measurement loop and into a separate setup method:

@Setup public void setUp(Blackhole bh) { // you can also inline this method now that there is only one call site forceLoadClasses(bh); }

nik9000 · 2020-07-13T20:01:31Z

Thanks for reviewing @danielmitterdorfer !

I've always been confused by the strange behavior that I saw when working on elastic#57304. Specifically, I saw switching from a bimorphic invocation to a monomorphic invocation to give us a 7%-15% performance bump. This felt *bonkers* to me. And, it also made me wonder whether it'd be worth looking into doing it everywhere. It turns out that, no, it isn't needed everywhere. This benchmark shows that a bimorphic invocation like: ``` LongKeyedBucketOrds ords = new LongKeyedBucketOrds.ForSingle(); ords.add(0, 0); <------ this line ``` is 19% slower than a monomorphic invocation like: ``` LongKeyedBucketOrds.ForSingle ords = new LongKeyedBucketOrds.ForSingle(); ords.add(0, 0); <------ this line ``` But *only* when the reference is mutable. In the example above, if `ords` is never changed then both perform the same. But if the `ords` reference is assigned twice then we start to see the difference: ``` immutable bimorphic avgt 10 6.468 ± 0.045 ns/op immutable monomorphic avgt 10 6.756 ± 0.026 ns/op mutable bimorphic avgt 10 9.741 ± 0.073 ns/op mutable monomorphic avgt 10 8.190 ± 0.016 ns/op ``` So the conclusion from all this is that we've done the right thing: `auto_date_histogram` is the only aggregation in which `ords` isn't final and it is the only aggregation that forces monomorphic invocations. All other aggregations use an immutable bimorphic invocation. Which is fine. Relates to elastic#56487

I've always been confused by the strange behavior that I saw when working on #57304. Specifically, I saw switching from a bimorphic invocation to a monomorphic invocation to give us a 7%-15% performance bump. This felt *bonkers* to me. And, it also made me wonder whether it'd be worth looking into doing it everywhere. It turns out that, no, it isn't needed everywhere. This benchmark shows that a bimorphic invocation like: ``` LongKeyedBucketOrds ords = new LongKeyedBucketOrds.ForSingle(); ords.add(0, 0); <------ this line ``` is 19% slower than a monomorphic invocation like: ``` LongKeyedBucketOrds.ForSingle ords = new LongKeyedBucketOrds.ForSingle(); ords.add(0, 0); <------ this line ``` But *only* when the reference is mutable. In the example above, if `ords` is never changed then both perform the same. But if the `ords` reference is assigned twice then we start to see the difference: ``` immutable bimorphic avgt 10 6.468 ± 0.045 ns/op immutable monomorphic avgt 10 6.756 ± 0.026 ns/op mutable bimorphic avgt 10 9.741 ± 0.073 ns/op mutable monomorphic avgt 10 8.190 ± 0.016 ns/op ``` So the conclusion from all this is that we've done the right thing: `auto_date_histogram` is the only aggregation in which `ords` isn't final and it is the only aggregation that forces monomorphic invocations. All other aggregations use an immutable bimorphic invocation. Which is fine. Relates to #56487

nik9000 added >non-issue :Analytics/Aggregations Aggregations v8.0.0 v7.9.0 labels Jun 26, 2020

nik9000 requested review from talevy, not-napoleon and polyfractal June 26, 2020 14:11

elasticmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Jun 26, 2020

Merge branch 'master' into long_keyed_bucket_ords

7c20cd4

nik9000 requested a review from danielmitterdorfer June 30, 2020 21:10

Merge branch 'master' into long_keyed_bucket_ords

9075f23

talevy reviewed Jul 1, 2020

View reviewed changes

nik9000 added 2 commits July 6, 2020 10:01

Merge branch 'master' into long_keyed_bucket_ords

2e4fb47

Move benchmark to proper package

80ba140

danielmitterdorfer reviewed Jul 8, 2020

View reviewed changes

nik9000 added 4 commits July 8, 2020 15:33

Merge branch 'master' into long_keyed_bucket_ords

6d26ed4

Blackhole

0f46086

Handle merge

dde47b2

javadoc

7ab3390

nik9000 added 2 commits July 8, 2020 15:47

Force init

d31dd30

Force loading classes

67b57db

nik9000 requested a review from danielmitterdorfer July 9, 2020 14:59

danielmitterdorfer approved these changes Jul 13, 2020

View reviewed changes

nik9000 added 2 commits July 13, 2020 14:33

Merge branch 'master' into long_keyed_bucket_ords

308c982

Move force loading classes to setup

f426b55

nik9000 merged commit e84a501 into elastic:master Jul 13, 2020

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add microbenchmark for LongKeyedBucketOrds #58608

Add microbenchmark for LongKeyedBucketOrds #58608

nik9000 commented Jun 26, 2020

elasticmachine commented Jun 26, 2020

polyfractal commented Jun 26, 2020

nik9000 commented Jun 26, 2020

talevy left a comment

nik9000 commented Jul 6, 2020

talevy commented Jul 6, 2020

danielmitterdorfer left a comment

danielmitterdorfer Jul 8, 2020

danielmitterdorfer Jul 8, 2020

danielmitterdorfer Jul 8, 2020

danielmitterdorfer Jul 8, 2020

danielmitterdorfer Jul 8, 2020

danielmitterdorfer Jul 8, 2020

nik9000 commented Jul 8, 2020

nik9000 commented Jul 8, 2020

danielmitterdorfer left a comment

danielmitterdorfer Jul 13, 2020

nik9000 Jul 13, 2020

nik9000 commented Jul 13, 2020

Add microbenchmark for LongKeyedBucketOrds #58608

Add microbenchmark for LongKeyedBucketOrds #58608

Conversation

nik9000 commented Jun 26, 2020

elasticmachine commented Jun 26, 2020

polyfractal commented Jun 26, 2020

nik9000 commented Jun 26, 2020

talevy left a comment

Choose a reason for hiding this comment

nik9000 commented Jul 6, 2020

talevy commented Jul 6, 2020

danielmitterdorfer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nik9000 commented Jul 8, 2020

nik9000 commented Jul 8, 2020

danielmitterdorfer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nik9000 commented Jul 13, 2020