Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Collector.setWeight to improve aggregation performance (for special cases) #10954

Closed
msfroh opened this issue Oct 27, 2023 · 2 comments · Fixed by #11643
Closed

Use Collector.setWeight to improve aggregation performance (for special cases) #10954

msfroh opened this issue Oct 27, 2023 · 2 comments · Fixed by #11643
Labels
enhancement Enhancement or improvement to existing feature or request Search:Performance v2.13.0 Issues and PRs related to version 2.13.0 v3.0.0 Issues and PRs related to version 3.0.0

Comments

@msfroh
Copy link
Collaborator

msfroh commented Oct 27, 2023

Lucene added a new setWeight method to the Collector interface a while back (see https://issues.apache.org/jira/browse/LUCENE-10620), specifically to give collectors access to the Weight.count() method.

Weight.count() only has a few cases where it returns things other than -1 (the value meaning "I can't give you a cheap count"), but the cases where it does return are pretty useful -- mostly "match all" or "match none", but for a single term query will return "I match exactly this many", if there are no deletions in the current segment (since it just reads the term's doc freq).

I believe this can be useful to short-circuit some aggregation logic, since aggregations all extend Collector.

These are the special cases that I've been able to think of where the weight.count(leafReaderContext) could hint at smarter computation of aggregations:

  1. If the top-level query matches nothing int the current segment (i.e. weight.count(leafReaderContext) == 0), then count for every bucket is 0. (If the min count is greater than 0, then you don't need to compute any buckets for this segment.)
  2. If the top-level query matches everything in the current segment (i.e. weight.count(leafReaderContext) == leafReaderContext.reader().maxDoc()), then the count of hits in a bucket (from the current segment) is determined entirely by the count of the bucket, which may be cheap to compute (e.g. doc freq for a terms aggregation, maybe read count from the BKD tree for a range aggregation).
  3. If the top-level query has some other positive count, but a bucket matches everything in the current segment (e.g. the documents in the current segment are all from the same day and we're computing a daily date histogram), then the bucket count is weight.count(leafReaderContext).

I didn't give it a lot of thought, so there might be some more that I'm missing.

@msfroh msfroh added enhancement Enhancement or improvement to existing feature or request untriaged Search:Performance and removed untriaged labels Oct 27, 2023
@getsaurabh02 getsaurabh02 moved this from Todo to Now (This Quarter) in Performance Roadmap Oct 27, 2023
@msfroh
Copy link
Collaborator Author

msfroh commented Oct 27, 2023

I think I would probably start with TermsAggregator (and its subclasses) to try to leverage this for cases where the top-level query matches all documents.

@getsaurabh02
Copy link
Member

I like the idea of TermsAggregator given it can benefit for String Values (Global Ordinals), Map String and Numeric term aggregation cases. In scenarios, where the top-level query matches everything in the current segment, I think the constraints holds that are no deletions in the current segment, to be able to short circuit (such as bucket count) ?

Should we try to instrument for the flow and run it for a time series data set (and benchmark)?
cc: @sandeshkr419

@getsaurabh02 getsaurabh02 moved this from Now (This Quarter) to In Progress in Performance Roadmap Dec 1, 2023
@harshavamsi harshavamsi moved this from In Progress to In-Review in Performance Roadmap Mar 7, 2024
@github-project-automation github-project-automation bot moved this from In-Review to Done in Performance Roadmap Mar 12, 2024
@reta reta added v3.0.0 Issues and PRs related to version 3.0.0 v2.13.0 Issues and PRs related to version 2.13.0 labels Mar 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request Search:Performance v2.13.0 Issues and PRs related to version 2.13.0 v3.0.0 Issues and PRs related to version 3.0.0
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

3 participants