Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Save memory when rare_terms is not on top #57948

Merged
merged 5 commits into from
Jun 12, 2020

Conversation

nik9000
Copy link
Member

@nik9000 nik9000 commented Jun 10, 2020

This uses the optimization that we started making in #55873 for
rare_terms to save a bit of memory when that aggregation is not on the
top level.

This uses the optimization that we started making in elastic#55873 for
`rare_terms` to save a bit of memory when that aggregation is not on the
top level.
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (:Analytics/Aggregations)

@elasticmachine elasticmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Jun 10, 2020
Copy link
Contributor

@polyfractal polyfractal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments, mostly around naming :) Only left them on the Long agg, but applies equally to String.

Otherwise LGTM :)

@@ -39,13 +39,13 @@
* An approximate set membership datastructure that scales as more unique values are inserted.
* Can definitively say if a member does not exist (no false negatives), but may say an item exists
* when it does not (has false positives). Similar in usage to a Bloom Filter.
*
* <p>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hate javadocs so much :( optimizing rendered readability while sacrificing IDE readability :(

Thanks for fixing this :)

long keepCount = 0;
long[] mergeMap = new long[(int) bucketOrds.size()];
Arrays.fill(mergeMap, -1);
long size = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this is a bit confusingly named I think? Maybe currentOffset or something? Not sure, but size feels a bit confusing.

LongRareTerms.Bucket bucket = new LongRareTerms.Bucket(ordsEnum.value(), docCount, null, format);
bucket.bucketOrd = mergeMap[(int) ordsEnum.ord()] = size + ordsToCollect.add(ordsEnum.value());
buckets.add(bucket);
keepCount++;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just change this to a boolean flag? hasDeletions or whatever?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to perform the merge if we don't keep all the buckets. We can remove buckets for two reasons now!

  1. The key is above the threshold.
  2. The owningBucketOrd isn't selected.

This counter will catch both ways. I couldn't come up with a cleaner way to do it.

// need to take care of dups
for (int i = 0; i < valuesCount; ++i) {
BytesRef bytes = values.nextValue();
if (filter != null && !filter.accept(bytes)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: !filter.accept() :)

(Also I realize the irony since the original code had that and it was my fault :) )

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

// Make a note when one of the ords has been deleted
deletionCount += 1;
filter.add(oldKey);
public InternalAggregation[] buildAggregations(long[] owningBucketOrds) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General comment about this method: we have a lot of "ords" being referenced and it's hard to keep track of which ord is which. E.g. we have the bucket ordinals that our parent is requesting we build, and then we have the bucket ordinals from each of those instances that we are collecting into buckets

Not sure how, but if we could find a way to rename the variables to help identify or disambiguate I think it would help a bunch.

@nik9000
Copy link
Member Author

nik9000 commented Jun 12, 2020

run elasticsearch-ci/default-distro

@nik9000
Copy link
Member Author

nik9000 commented Jun 12, 2020

run elasticsearch-ci/1

Copy link
Member Author

@nik9000 nik9000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll see about cleaning up the "ords ords ords ords" stuff too.

LongRareTerms.Bucket bucket = new LongRareTerms.Bucket(ordsEnum.value(), docCount, null, format);
bucket.bucketOrd = mergeMap[(int) ordsEnum.ord()] = size + ordsToCollect.add(ordsEnum.value());
buckets.add(bucket);
keepCount++;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to perform the merge if we don't keep all the buckets. We can remove buckets for two reasons now!

  1. The key is above the threshold.
  2. The owningBucketOrd isn't selected.

This counter will catch both ways. I couldn't come up with a cleaner way to do it.

// need to take care of dups
for (int i = 0; i < valuesCount; ++i) {
BytesRef bytes = values.nextValue();
if (filter != null && !filter.accept(bytes)) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@nik9000 nik9000 changed the title Same memory when rare_terms is not on top Save memory when rare_terms is not on top Jun 12, 2020
@nik9000 nik9000 merged commit 933565d into elastic:master Jun 12, 2020
@nik9000
Copy link
Member Author

nik9000 commented Jun 12, 2020

Thanks @polyfractal !

nik9000 added a commit to nik9000/elasticsearch that referenced this pull request Jun 12, 2020
This uses the optimization that we started making in elastic#55873 for
`rare_terms` to save a bit of memory when that aggregation is not on the
top level.
nik9000 added a commit that referenced this pull request Jun 12, 2020
This uses the optimization that we started making in #55873 for
`rare_terms` to save a bit of memory when that aggregation is not on the
top level.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/Aggregations Aggregations >enhancement Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v7.9.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants