-
Notifications
You must be signed in to change notification settings - Fork 24.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Save memory when rare_terms is not on top #57948
Conversation
This uses the optimization that we started making in elastic#55873 for `rare_terms` to save a bit of memory when that aggregation is not on the top level.
Pinging @elastic/es-analytics-geo (:Analytics/Aggregations) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few comments, mostly around naming :) Only left them on the Long agg, but applies equally to String.
Otherwise LGTM :)
@@ -39,13 +39,13 @@ | |||
* An approximate set membership datastructure that scales as more unique values are inserted. | |||
* Can definitively say if a member does not exist (no false negatives), but may say an item exists | |||
* when it does not (has false positives). Similar in usage to a Bloom Filter. | |||
* | |||
* <p> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hate javadocs so much :( optimizing rendered readability while sacrificing IDE readability :(
Thanks for fixing this :)
long keepCount = 0; | ||
long[] mergeMap = new long[(int) bucketOrds.size()]; | ||
Arrays.fill(mergeMap, -1); | ||
long size = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, this is a bit confusingly named I think? Maybe currentOffset
or something? Not sure, but size feels a bit confusing.
LongRareTerms.Bucket bucket = new LongRareTerms.Bucket(ordsEnum.value(), docCount, null, format); | ||
bucket.bucketOrd = mergeMap[(int) ordsEnum.ord()] = size + ordsToCollect.add(ordsEnum.value()); | ||
buckets.add(bucket); | ||
keepCount++; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we just change this to a boolean flag? hasDeletions
or whatever?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to perform the merge if we don't keep all the buckets. We can remove buckets for two reasons now!
- The key is above the threshold.
- The
owningBucketOrd
isn't selected.
This counter will catch both ways. I couldn't come up with a cleaner way to do it.
// need to take care of dups | ||
for (int i = 0; i < valuesCount; ++i) { | ||
BytesRef bytes = values.nextValue(); | ||
if (filter != null && !filter.accept(bytes)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: !filter.accept()
:)
(Also I realize the irony since the original code had that and it was my fault :) )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
// Make a note when one of the ords has been deleted | ||
deletionCount += 1; | ||
filter.add(oldKey); | ||
public InternalAggregation[] buildAggregations(long[] owningBucketOrds) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General comment about this method: we have a lot of "ords" being referenced and it's hard to keep track of which ord is which. E.g. we have the bucket ordinals that our parent is requesting we build, and then we have the bucket ordinals from each of those instances that we are collecting into buckets
Not sure how, but if we could find a way to rename the variables to help identify or disambiguate I think it would help a bunch.
run elasticsearch-ci/default-distro |
run elasticsearch-ci/1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll see about cleaning up the "ords ords ords ords" stuff too.
LongRareTerms.Bucket bucket = new LongRareTerms.Bucket(ordsEnum.value(), docCount, null, format); | ||
bucket.bucketOrd = mergeMap[(int) ordsEnum.ord()] = size + ordsToCollect.add(ordsEnum.value()); | ||
buckets.add(bucket); | ||
keepCount++; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to perform the merge if we don't keep all the buckets. We can remove buckets for two reasons now!
- The key is above the threshold.
- The
owningBucketOrd
isn't selected.
This counter will catch both ways. I couldn't come up with a cleaner way to do it.
// need to take care of dups | ||
for (int i = 0; i < valuesCount; ++i) { | ||
BytesRef bytes = values.nextValue(); | ||
if (filter != null && !filter.accept(bytes)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Thanks @polyfractal ! |
This uses the optimization that we started making in elastic#55873 for `rare_terms` to save a bit of memory when that aggregation is not on the top level.
This uses the optimization that we started making in #55873 for
rare_terms
to save a bit of memory when that aggregation is not on thetop level.