Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Aggregation include/exclude should support faster filtering on prefixes #14368

Closed
msfroh opened this issue Jun 14, 2024 · 2 comments · Fixed by #14371
Closed
Assignees
Labels
enhancement Enhancement or improvement to existing feature or request Search:Aggregations v2.17.0

Comments

@msfroh
Copy link
Collaborator

msfroh commented Jun 14, 2024

Is your feature request related to a problem? Please describe

I've been working with an ecommerce user of OpenSearch, where they've implemented something like Lucene's hierarchical facets by tagging products with category:label pairs (like color:red). These are collected using terms aggregations.

To avoid collecting facet labels that are irrelevant to the current search, they have some mechanism to identify the relevant facet categories for the query. They've been filtering these using a regular expression in the include parameter for their terms aggregation, like "include": "(color|size|brand|material|machine\\ washable|sleeve\\ length|...):.+", in order to filter the category:label pairs based on a specific set of categories. Overall, prefix filtering on terms aggregations seems like a fairly reasonable thing to want to do.

Unfortunately, this really slows down search requests, as the IncludeExclude class tries to step through all possible category:label values (in the global ordinals) that match the expression. The commit from 2015 that added the current automaton-based behavior (which was a significant speedup from what came before) mentions this problem in the commit message:

Today we check every regular expression eagerly against every possible term.
This can be very slow if you have lots of unique terms, and even the bottleneck
if your query is selective.

Indeed, there's a TODO in place calling out that prefix matching should be a special case:

// TODO: specialize based on compiled.type: for ALL and prefixes (sinkState >= 0 ) we can avoid i/o and just set bits.

https://github.com/msfroh/OpenSearch/blob/cf2c31fffe844f78f17cf1c2a780198b9b6258d4/server/src/main/java/org/opensearch/search/aggregations/bucket/terms/IncludeExclude.java#L344

Describe the solution you'd like

I would like to specialize the IncludeExclude.AutomatonBackedOrdinalsFilter to handle prefixes better.

Ideally, I would like to make it fully transparent to the user -- essentially, I'd like to address the TODO that I listed above, where we simply handle prefixes as a special case of regexp include/exclude.

In order to do that, I need to learn more about Lucene's automaton matching, which is something that I would like to wrap my head around anyway. (I learned a little bit as a result of #13461, but that only scratched the surface. I want to know more.)

Related component

Search:Aggregations

Describe alternatives you've considered

If the Lucene automaton rabbit hole turns out to be too dark and scary, another option could be to add include_prefixes and exclude_prefixes parameters that we can parse from the IncludeExclude class.

Once we have that, we could implement logic similar to

public LongBitSet acceptedGlobalOrdinals(SortedSetDocValues globalOrdinals) throws IOException {
LongBitSet acceptedGlobalOrdinals = new LongBitSet(globalOrdinals.getValueCount());
TermsEnum globalTermsEnum;
Terms globalTerms = new DocValuesTerms(globalOrdinals);
// TODO: specialize based on compiled.type: for ALL and prefixes (sinkState >= 0 ) we can avoid i/o and just set bits.
globalTermsEnum = compiled.getTermsEnum(globalTerms);
for (BytesRef term = globalTermsEnum.next(); term != null; term = globalTermsEnum.next()) {
acceptedGlobalOrdinals.set(globalTermsEnum.ord());
}
return acceptedGlobalOrdinals;
}

The new logic would be something like:

static class PrefixBackedOrdinalsFilter extends OrdinalsFilter {

  private final SortedSet<BytesRef> include, exclude;

  private PrefixBackedOrdinalsFilter(SortedSet<BytesRef> include, SortedSet<BytesRef> exclude) {
    this.include = include;
    this.exclude = exclude;
  }

  private static BytesRef nextBytesRef(BytesRef bytesRef) {
    BytesRef next = BytesRef.deepCopyOf(bytesRef);
    int pos = next.bytes.length - 1;
    while (pos >= 0 && next.bytes[pos] == 255) {
      next.bytes[pos] = 0;
      pos--;
    }
    if (pos >= 0) {
      next.bytes[pos]++;
    } else {
      // Every byte in our prefix had value 255. We must match all subsequent ordinals.
      return null;
    }
    return next;
  }

  @Override
  public LongBitSet acceptedGlobalOrdinals(SortedSetDocValues globalOrdinals) throws IOException {
    LongBitSet accept = new LongBitSet(globalOrdinals.getValueCount());
    for (BytesRef inc : includes) {
      int startOrd = globalOrdinals.lookupTerm(inc);
      if (startOrd < 0) {
        startOrd = -1 - startOrd;
      }
      if (startOrd >= accept.length()) {
        continue;
      }
      BytesRef next = nextBytesRef(inc);
      if (next == null) {
        accept.set(startOrd, accept.length());
      } else {
        endOrd = globalOrdinals.lookupTerm(next);
        if (endOrd < 0) {
          endOrd = -1 - endOrd;
        }
        if (startOrd < endOrd) {
          accept.set(startOrd, endOrd);
        }
      }
    }
    for (BytesRef exc : excludes) {
      // Same logic as above, but we use `accept.clear(...)` to unset bits.
    }
  }
}

Additional context

No response

@msfroh msfroh added enhancement Enhancement or improvement to existing feature or request untriaged labels Jun 14, 2024
@msfroh
Copy link
Collaborator Author

msfroh commented Jun 15, 2024

Created a draft PR that is still completely untested: #14371

@harshavamsi
Copy link
Contributor

Lucene's automaton matching

@msfroh is there a particular class/package I could look at for reference?

@getsaurabh02 getsaurabh02 added the v2.16.0 Issues and PRs related to version 2.16.0 label Jun 24, 2024
@mch2 mch2 removed the untriaged label Jun 26, 2024
@getsaurabh02 getsaurabh02 added v2.17.0 and removed v2.16.0 Issues and PRs related to version 2.16.0 labels Jul 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request Search:Aggregations v2.17.0
Projects
Status: 2.17 (First RC 09/03, Release 09/17)
Status: Done
Status: Done
Development

Successfully merging a pull request may close this issue.

4 participants