Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add microbenchmark for LongKeyedBucketOrds #58608

Merged
merged 13 commits into from
Jul 13, 2020

Conversation

nik9000
Copy link
Member

@nik9000 nik9000 commented Jun 26, 2020

I've always been confused by the strange behavior that I saw when
working on #57304. Specifically, I saw switching from a bimorphic
invocation to a monomorphic invocation to give us a 7%-15% performance
bump. This felt bonkers to me. And, it also made me wonder whether
it'd be worth looking into doing it everywhere.

It turns out that, no, it isn't needed everywhere. This benchmark shows
that a bimorphic invocation like:

LongKeyedBucketOrds ords = new LongKeyedBucketOrds.ForSingle();
ords.add(0, 0); <------ this line

is 19% slower than a monomorphic invocation like:

LongKeyedBucketOrds.ForSingle ords = new LongKeyedBucketOrds.ForSingle();
ords.add(0, 0); <------ this line

But only when the reference is mutable. In the example above, if
ords is never changed then both perform the same. But if the ords
reference is assigned twice then we start to see the difference:

immutable bimorphic    avgt   10   6.468 ± 0.045  ns/op
immutable monomorphic  avgt   10   6.756 ± 0.026  ns/op
mutable   bimorphic    avgt   10   9.741 ± 0.073  ns/op
mutable   monomorphic  avgt   10   8.190 ± 0.016  ns/op

So the conclusion from all this is that we've done the right thing:
auto_date_histogram is the only aggregation in which ords isn't final
and it is the only aggregation that forces monomorphic invocations. All
other aggregations use an immutable bimorphic invocation. Which is fine.

Relates to #56487

I've always been confused by the strange behavior that I saw when
working on elastic#57304. Specifically, I saw switching from a bimorphic
invocation to a monomorphic invocation to give us a 7%-15% performance
bump. This felt *bonkers* to me. And, it also made me wonder whether
it'd be worth looking into doing it everywhere.

It turns out that, no, it isn't needed everywhere. This benchmark shows
that a bimorphic invocation like:
```
LongKeyedBucketOrds ords = new LongKeyedBucketOrds.ForSingle();
ords.add(0, 0); <------ this line
```

is 19% slower than a monomorphic invocation like:
```
LongKeyedBucketOrds.ForSingle ords = new LongKeyedBucketOrds.ForSingle();
ords.add(0, 0); <------ this line
```

But *only* when the reference is mutable. In the example above, if
`ords` is never changed then both perform the same. But if the `ords`
reference is assigned twice then we start to see the difference:
```
immutable bimorphic    avgt   10   6.468 ± 0.045  ns/op
immutable monomorphic  avgt   10   6.756 ± 0.026  ns/op
mutable   bimorphic    avgt   10   9.741 ± 0.073  ns/op
mutable   monomorphic  avgt   10   8.190 ± 0.016  ns/op
```

So the conclusion from all this is that we've done the right thing:
`auto_date_histogram` is the only aggregation in which `ords` isn't final
and it is the only aggregation that forces monomorphic invocations. All
other aggregations use an immutable bimorphic invocation. Which is fine.

Relates to elastic#56487
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (:Analytics/Aggregations)

@elasticmachine elasticmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Jun 26, 2020
@polyfractal
Copy link
Contributor

Huh, interesting. I guess with the bimorphic case, the JVM is failing to recognize that it's still actually monomorphic despite being mutable, so probably falling back to a dynamic dispatch table or something? JVM black magic I suppose :)

Afraid my micro-benchmark-fu is very weak though, so not really sure I can review this haha :) Should we grab someone from the performance team to give it a quick skim?

@nik9000
Copy link
Member Author

nik9000 commented Jun 26, 2020

Huh, interesting. I guess with the bimorphic case, the JVM is failing to recognize that it's still actually monomorphic despite being mutable, so probably falling back to a dynamic dispatch table or something? JVM black magic I suppose :)

I did a little digging with perfasm and, even though I don't know what I'm doing, like, at all, I found that the mutable version of the call goes like:

                    0x00007f80d36fd11c: ;*lrem {reexecute=0 rethrow=0 return_oop=0}
                                        ; - org.elasticsearch.search.aggregations.bucket.terms.LongKeyedBucketOrdsBenchmark::singleBucketIntoSingleMutableBimorphicInvocation@48 (line 117)
                    0x00007f80d36fd11c: d34d 2bd0 | 4d85 c90f | 84e9 0100 
                    0x00007f80d36fd128: ;*ifne {reexecute=0 rethrow=0 return_oop=0}
                                        ; - org.elasticsearch.search.aggregations.bucket.terms.LongKeyedBucketOrdsBenchmark::singleBucketIntoSingleMutableBimorphicInvocation@26 (line 113)
                    0x00007f80d36fd128: 0048 8b44 | 2408 4885 | c00f 84c2 
                    0x00007f80d36fd134: ;*aload_1 {reexecute=0 rethrow=0 return_oop=0}
                                        ; - org.elasticsearch.search.aggregations.bucket.terms.LongKeyedBucketOrdsBenchmark::singleBucketIntoSingleMutableBimorphicInvocation@42 (line 117)
                    0x00007f80d36fd134: 0200 0044 
                    0x00007f80d36fd138: ;   {metadata(&apos;org/elasticsearch/search/aggregations/bucket/terms/LongKeyedBucketOrds$FromSingle&apos;)}
                    0x00007f80d36fd138: 8b40 0841 | 81f8 25b2 | 1700 0f85 
                    0x00007f80d36fd144: ;*invokevirtual add {reexecute=0 rethrow=0 return_oop=0}
                                        ; - org.elasticsearch.search.aggregations.bucket.terms.LongKeyedBucketOrdsBenchmark::singleBucketIntoSingleMutableBimorphicInvocation@49 (line 117)
                    0x00007f80d36fd144: 1403 0000 | 448b 400c 
  0.06%             0x00007f80d36fd14c: ; implicit exception: dispatches to 0x00007f80d36fd668
  0.06%             0x00007f80d36fd14c: 4f8b 4cc4 | 104b 8d0c 
                    0x00007f80d36fd154: ;*getfield ords {reexecute=0 rethrow=0 return_oop=0}
                                        ; - org.elasticsearch.search.aggregations.bucket.terms.LongKeyedBucketOrds$FromSingle::add@21 (line 126)
                                        ; - org.elasticsearch.search.aggregations.bucket.terms.LongKeyedBucketOrdsBenchmark::singleBucketIntoSingleMutableBimorphicInvocation@49 (line 117)

but the immutable version is:

                    0x00007f938f6f87bc: ;*lrem {reexecute=0 rethrow=0 return_oop=0}
                                        ; - org.elasticsearch.search.aggregations.bucket.terms.LongKeyedBucketOrdsBenchmark::singleBucketIntoSingleImmutableBimorphicInvocation@25 (line 83)
                    0x00007f938f6f87bc: 3f4c 8b0c 
                    0x00007f938f6f87c0: ;*aload_1 {reexecute=0 rethrow=0 return_oop=0}
                                        ; - org.elasticsearch.search.aggregations.bucket.terms.LongKeyedBucketOrdsBenchmark::singleBucketIntoSingleImmutableBimorphicInvocation@19 (line 83)
                    0x00007f938f6f87c0: 2448 b867 | 6666 6666 | 6666 6648 | f7eb 458b 
                    0x00007f938f6f87d0: ;*getfield ords {reexecute=0 rethrow=0 return_oop=0}
                                        ; - org.elasticsearch.search.aggregations.bucket.terms.LongKeyedBucketOrds$FromSingle::add@21 (line 126)
                                        ; - org.elasticsearch.search.aggregations.bucket.terms.LongKeyedBucketOrdsBenchmark::singleBucketIntoSingleImmutableBimorphicInvocation@26 (line 83)

I'm almost certainly doing this wrong, but it looks like the immutable call site is inlined, but the mutable one is has the extra metadata and invokevirtual bits.

Now my benchmark is a little different because our loop is a tighter than the one from lucene and because I'm using a final variable in the method instead of on an object. I expect that the second one is a distinction without a difference. And the first one, well, Lucene seems to call our collectors pretty darn quick. But I still don't think this is all of it.

I do expect to do a little more work on these data structures in the future and to use the microbenchmark to sound out performance changes to them.

Copy link
Contributor

@talevy talevy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to Zach, I am unclear under which metrics to review a benchmark, but I do
have some questions and comments!

Should this be under the benchmarks package, not within server?

I've wanted to add benchmarks in the past, and was told the only ones we have committed are there to be used as examples. Your comment below:

I do expect to do a little more work on these data structures in the future and to use the microbenchmark to sound out performance changes to them.

convinces me that we should record this information somewhere, so why not merge it.

@nik9000
Copy link
Member Author

nik9000 commented Jul 6, 2020

Should we grab someone from the performance team to give it a quick skim?

I've added @danielmitterdorfer, yeah.

Should this be under the benchmarks package, not within server?

It's in the benchmarks project. I just stuck it in the same package as the code without thinking about it. I'll rename the package.

I've wanted to add benchmarks in the past, and was told the only ones we have committed are there to be used as examples.

I'd never heard that before! I figure if we can get useful data from the benchmarks now then committing them will make sure that we can at least compile them the next time we need them.

@talevy
Copy link
Contributor

talevy commented Jul 6, 2020

It's in the benchmarks project. I just stuck it in the same package as the code without thinking about it. I'll rename the package

woops, apologize. missed that. carry on!

Copy link
Member

@danielmitterdorfer danielmitterdorfer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the benchmarks are prone to dead code eliminitation. I suggest to return ords so it is consumed.

* because it is not needed.
*/
@Benchmark
public void singleBucketIntoSingleImmutableMonmorphicInvocation() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This benchmark is prone to an optimization called dead code elimination because you neither return anything nor you feed anything into JMH's Blackhole.

* Emulates the way that most aggregations use {@link LongKeyedBucketOrds}.
*/
@Benchmark
public void singleBucketIntoSingleImmutableBimorphicInvocation() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This benchmark is prone to an optimization called dead code elimination because you neither return anything nor you feed anything into JMH's Blackhole.

* Emulates the way that {@link AutoDateHistogramAggregationBuilder} uses {@link LongKeyedBucketOrds}.
*/
@Benchmark
public void singleBucketIntoSingleMutableMonmorphicInvocation() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This benchmark is prone to an optimization called dead code elimination because you neither return anything nor you feed anything into JMH's Blackhole.

* {@link #singleBucketIntoSingleMutableMonmorphicInvocation() monomorphic invocation}.
*/
@Benchmark
public void singleBucketIntoSingleMutableBimorphicInvocation() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This benchmark is prone to an optimization called dead code elimination because you neither return anything nor you feed anything into JMH's Blackhole.

* aggregation and there is only a single value for that term in the index.
*/
@Benchmark
public void singleBucketIntoMulti() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This benchmark is prone to an optimization called dead code elimination because you neither return anything nor you feed anything into JMH's Blackhole.

* Emulates an aggregation that collects from many buckets.
*/
@Benchmark
public void multiBucket() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This benchmark is prone to an optimization called dead code elimination because you neither return anything nor you feed anything into JMH's Blackhole.

@nik9000
Copy link
Member Author

nik9000 commented Jul 8, 2020

I think the benchmarks are prone to dead code eliminitation. I suggest to return ords so it is consumed.

Good call!

@nik9000
Copy link
Member Author

nik9000 commented Jul 8, 2020

I pushed a couple of updates. The results are much the same but the bimorphic mutable invocation has grown a little more efficient. I'm not sure why, but I'll take it. I still see a difference, but it is less pronounced:

Benchmark                                                                         Mode  Cnt   Score   Error  Units
LongKeyedBucketOrdsBenchmark.multiBucket                                          avgt   10  14.324 ± 0.160  ns/op
LongKeyedBucketOrdsBenchmark.singleBucketIntoMulti                                avgt   10  13.505 ± 0.105  ns/op
LongKeyedBucketOrdsBenchmark.singleBucketIntoSingleImmutableBimorphicInvocation   avgt   10   6.489 ± 0.001  ns/op
LongKeyedBucketOrdsBenchmark.singleBucketIntoSingleImmutableMonmorphicInvocation  avgt   10   6.463 ± 0.016  ns/op
LongKeyedBucketOrdsBenchmark.singleBucketIntoSingleMutableBimorphicInvocation     avgt   10   8.850 ± 0.057  ns/op
LongKeyedBucketOrdsBenchmark.singleBucketIntoSingleMutableMonmorphicInvocation    avgt   10   8.142 ± 0.049  ns/op

Its now less than 10%. The performance testing that I did on the agg that these are emulating showed at least a 7% difference, which is pretty substantial.

Copy link
Member

@danielmitterdorfer danielmitterdorfer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left one suggestion but LGTM otherwise. No need for another review round.

*/
@Benchmark
public void singleBucketIntoSingleImmutableMonmorphicInvocation(Blackhole bh) {
forceLoadClasses(bh);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume you're not interested in measuring class loading so you can move this code out of the measurement loop and into a separate setup method:

    @Setup
    public void setUp(Blackhole bh) {
        // you can also inline this method now that there is only one call site
        forceLoadClasses(bh);
    }

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@nik9000 nik9000 merged commit e84a501 into elastic:master Jul 13, 2020
@nik9000
Copy link
Member Author

nik9000 commented Jul 13, 2020

Thanks for reviewing @danielmitterdorfer !

nik9000 added a commit to nik9000/elasticsearch that referenced this pull request Jul 13, 2020
I've always been confused by the strange behavior that I saw when
working on elastic#57304. Specifically, I saw switching from a bimorphic
invocation to a monomorphic invocation to give us a 7%-15% performance
bump. This felt *bonkers* to me. And, it also made me wonder whether
it'd be worth looking into doing it everywhere.

It turns out that, no, it isn't needed everywhere. This benchmark shows
that a bimorphic invocation like:
```
LongKeyedBucketOrds ords = new LongKeyedBucketOrds.ForSingle();
ords.add(0, 0); <------ this line
```

is 19% slower than a monomorphic invocation like:
```
LongKeyedBucketOrds.ForSingle ords = new LongKeyedBucketOrds.ForSingle();
ords.add(0, 0); <------ this line
```

But *only* when the reference is mutable. In the example above, if
`ords` is never changed then both perform the same. But if the `ords`
reference is assigned twice then we start to see the difference:
```
immutable bimorphic    avgt   10   6.468 ± 0.045  ns/op
immutable monomorphic  avgt   10   6.756 ± 0.026  ns/op
mutable   bimorphic    avgt   10   9.741 ± 0.073  ns/op
mutable   monomorphic  avgt   10   8.190 ± 0.016  ns/op
```

So the conclusion from all this is that we've done the right thing:
`auto_date_histogram` is the only aggregation in which `ords` isn't final
and it is the only aggregation that forces monomorphic invocations. All
other aggregations use an immutable bimorphic invocation. Which is fine.

Relates to elastic#56487
nik9000 added a commit that referenced this pull request Jul 13, 2020
I've always been confused by the strange behavior that I saw when
working on #57304. Specifically, I saw switching from a bimorphic
invocation to a monomorphic invocation to give us a 7%-15% performance
bump. This felt *bonkers* to me. And, it also made me wonder whether
it'd be worth looking into doing it everywhere.

It turns out that, no, it isn't needed everywhere. This benchmark shows
that a bimorphic invocation like:
```
LongKeyedBucketOrds ords = new LongKeyedBucketOrds.ForSingle();
ords.add(0, 0); <------ this line
```

is 19% slower than a monomorphic invocation like:
```
LongKeyedBucketOrds.ForSingle ords = new LongKeyedBucketOrds.ForSingle();
ords.add(0, 0); <------ this line
```

But *only* when the reference is mutable. In the example above, if
`ords` is never changed then both perform the same. But if the `ords`
reference is assigned twice then we start to see the difference:
```
immutable bimorphic    avgt   10   6.468 ± 0.045  ns/op
immutable monomorphic  avgt   10   6.756 ± 0.026  ns/op
mutable   bimorphic    avgt   10   9.741 ± 0.073  ns/op
mutable   monomorphic  avgt   10   8.190 ± 0.016  ns/op
```

So the conclusion from all this is that we've done the right thing:
`auto_date_histogram` is the only aggregation in which `ords` isn't final
and it is the only aggregation that forces monomorphic invocations. All
other aggregations use an immutable bimorphic invocation. Which is fine.

Relates to #56487
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/Aggregations Aggregations >non-issue Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v7.9.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants