Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wildcard field optimised for wildcard queries #49993

Merged
merged 32 commits into from
Mar 16, 2020

Conversation

markharwood
Copy link
Contributor

First cut at the wildcard field type.
Closes #48852

@markharwood markharwood added WIP :Search Foundations/Mapping Index mappings, including merging and defining field types labels Dec 9, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (:Search/Mapping)

@markharwood markharwood force-pushed the fix/48852 branch 2 times, most recently from 9b297fd to 452987a Compare December 10, 2019 14:18
@markharwood
Copy link
Contributor Author

This PR is currently missing support for arrays. Should it add that?
Presumably wildcards expressions are expected to match within an array element (as opposed to span/interval queries on text arrays that can "run over" array elements).

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to support arrays and require wildcards to match within array elements, like for keyword fields. I left some comments, I'm curious whether you've had a chance to check the space overhead compared to a keyword field?

fields.add(field);
Field dvField = new BinaryDocValuesField(fieldType().name(), new BytesRef(value));
fields.add(dvField);
if (fieldType().omitNorms()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this is what we would do for a field that doesn't have doc values, like text. This field requires doc values anyway so we can skip this if statement entirely and create exists queries from doc values.

public class WildcardOnDvQuery extends Query {

private String field;
private String wildcardPattern;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make them private?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry I meant final


@Override
public boolean matches() throws IOException {
if (values.advanceExact(approximation.docID())) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

other doc values queries instead use values as an approximation, which removes the need to call advanceExact here

long ord = values.nextOrd();
while (ord != SortedSetDocValues.NO_MORE_ORDS) {
BytesRef value = values.lookupOrd(ord);
if (bytesMatcher.run(value.bytes, 0, value.length)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use value.offset even if it is always 0 in practice with the current codec

@markharwood
Copy link
Contributor Author

Thanks for the review, @jpountz
Just a couple of things left to do on this I think:

  1. docs
  2. remove the YAML test? (I had that before I got the integ test going).

@jpountz
Copy link
Contributor

jpountz commented Dec 13, 2019

Actually our general recommendation is rather to avoid integration tests. In that case, we'd do extensive testing about whether it matches wildcard queries correctly in unit tests, and have a YAML test mostly to ensure that everything is wired together correctly. And no EsIntegTestCase.

@jpountz
Copy link
Contributor

jpountz commented Dec 13, 2019

I'm also curious to get some more performance/disk-overhead data before doing a more in-depth review, in order to know whether we need to go back to the drawing board or whether we want to move forward with this approach?

@jpountz
Copy link
Contributor

jpountz commented Dec 16, 2019

Thinking more about the performance comparison, I suspect we will want to check at least two scenarios:

  1. wildcard query with no filter
  2. wildcard query with a very selective filter (possibly an ids filter on a couple ids to test the worst-case scenario)

I'm expecting the wildcard_keyword field to be sometimes faster sometimes slower for 1 (depending on the size of the terms dict and whether there is a leading wildcard) but consistently faster for 2?

@markharwood
Copy link
Contributor Author

markharwood commented Dec 18, 2019

Some preliminary results from the current wildcard-on-DV approach.
I gave up trying to index 5 char ngrams - with 1G heap it was causing excessive GC and causing timeouts - likely due to the volume of unique terms being generated.
I opted for testing with 3 char ngrams and found favourable results scanning elasticsearch log files for key event strings. The resulting index size is 650mb:

pattern wildcard took time (ms) keyword took time (ms) wildcard num matches keyword num matches
*above the warn threshold* 2 359 192 192
*nternal queue is full* 1 353 0 0
*timed out waiting for all nodes to process published state* 2 345 0 0
*now throttling indexing* 1 337 0 0
*cluster state update task* 1 338 3 3
*above the warn threshold* 2 337 192 192
*done applying updated cluster_state* 1 344 0 0
*Limit of total fields* 73 335 7886 7886
*No space left on device* 1 336 0 0
*Failed to parse query* 1 342 0 0
*] stopping ...* 1 336 25 25
*] starting ...* 1 335 42 42
*:\\elastic\\elasticsearch-6.6.1_prod\\config\\roles.yml* 2 355 6 6

Most of the pattern matches are rare but when a search matches many docs eg the 7,886 result we can see an increase in the cost of the wildcard approach.
This was done without a selective filter clause (my patterns are generally selective enough).

@markharwood
Copy link
Contributor Author

I wonder if we can eliminate some of the costs involved in storing/scanning machine-generated strings (e.g. stacktraces) if we can identify common subsequences at index time.

The significant text aggregation has some logic to do this in the DeDuplicatingTokenFilter. Sequences of 6 words that are repeated in a stream of docs are identified using a low-cost lookup. The significant text aggregation removes whole sections of text that are seen as duplicate but we could potentially use the same structure to store and search more compact representations of near-duplicate strings.

@markharwood
Copy link
Contributor Author

As for disk usage - my test log index was a total of 366mb for keyword only mapping Vs 398mb for wildcard_keyword mapping only (using 3 gram + dv)

@markharwood
Copy link
Contributor Author

Possible issue with position-based matching when we have array fields (at least in terms of behaviour compared to wildcard search on keyword fields).
I got a false positive for the query a*aba on the doc ["ababaaa","baaabbabbbaabbaba"].
We could add a large position increment into the index for array elements but the * in query means an unlimited gap so would not work.
This could be tricky so perhaps we should just assume these implementations diverge in this behaviour?

@jpountz
Copy link
Contributor

jpountz commented Dec 19, 2019

Hmm this is a good point. I believe using positions would also have the same downside as sloppy phrase queries have about missing matches, since they don't (and maybe can't realistically because of the combinatorial explosion) explore all matches.

@jpountz
Copy link
Contributor

jpountz commented Dec 19, 2019

Unless @jimczi feels otherwise, then I guess positions are not really an option and we should just explore the path about indexing ngrams with docs only (no freqs or positions) and using doc values to verify?

@markharwood
Copy link
Contributor Author

markharwood commented Dec 19, 2019

I'm hoping to get some numbers tomorrow on position-based search speeds so we can see what we are throwing away if we decide to not go down this route

@jpountz
Copy link
Contributor

jpountz commented Dec 19, 2019

Ah, if you are close to having some numbers that would be great indeed!

@markharwood
Copy link
Contributor Author

I've not got 100% compatibility with keyword+wildcard matching behaviour but close enough for benchmarks I expect

@jimczi
Copy link
Contributor

jimczi commented Dec 19, 2019

I believe using positions would also have the same downside as sloppy phrase queries have about missing matches, since they don't (and maybe can't realistically because of the combinatorial explosion) explore all matches.

We only need an ordered query so I am not sure about the combinatorial explosion ? Phrase queries with slop allows some reordering so that's out of scope for this field. Intervals query would work well I think but we need to fix the case where there are duplicates in the query first:
apache/lucene-solr#1097
My take on this is that we should compare with a position-based approach since doc values would only be useful for validations and they add to the total cost especially in terms of index size.

Possible issue with position-based matching when we have array fields (at least in terms of behaviour compared to wildcard search on keyword fields).

That's a good point indeed and maybe something that we can restrict or explain in the documentation but imo this shouldn't eliminate this option completely.

I'm hoping to get some numbers tomorrow on position-based search speeds so we can see what we are throwing away if we decide to not go down this route

Speed and index size, that's my main motivation to test this since doc values seem overkill if we don't want to restrict this to small fields only.

@jpountz
Copy link
Contributor

jpountz commented Dec 19, 2019

Ah, you are right, I got confused. The fact that terms within the phrase must be ordered avoids the combinatorial explosion issue that sloppy phrase queries have indeed.

@jimczi
Copy link
Contributor

jimczi commented Dec 20, 2019

I got confused too ;), the issue in apache/lucene-solr#1097 does not affect strictly ordered query so using Intervals to validate our ngrams with positions should work.
So a query like abcdef*xyz could be rewritten into Intervals.ordered(Intervals.phrase(abc, bcd, cde, def), xyz) and simplified into Intervals.phrase(abc, Interval.extend(def, 2)), xyz).
I am not sure if it's a problem or not but we'd need to handle wildcard intervals too since a query like a*b contains blocks that are smaller than the ngram size. We limit the expansion in these queries to the maximum boolean clause so we might not be able to find a match for these queries using positions only. However we're leaning towards smaller ngrams (3) so this should only be a problem for blocks that have a single character before (or in-between) a wildcard.

@markharwood
Copy link
Contributor Author

markharwood commented Dec 20, 2019

Index size comparisons

All force-merged copies of the same elasticsearch.log strings.

  • wildcardtest_kw: plain keyword
  • wildcard_pos: 3-gram, indexed positions, no DV
  • wildcard_dv: 3-gram, no positions, DV
  • wildcard_txt: default text field - (does not support EQL-style wildcard searches so used for size comparison only)
index docs.count store.size
wildcardtest_kw 1768155 356.8mb
wildcardtest_pos 1768155 963mb
wildcardtest_dv 1768155 387.9mb
wildcardtest_txt 1768155 182.4mb

@jimczi
Copy link
Contributor

jimczi commented Mar 13, 2020

Is the issue that it should be formatted as 1 line instead of 2? I have that change in the latest code.

No, that’s about not accepting similarity and index_options in the builder: here and here.

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @markharwood for all the iterations on this!

@markharwood markharwood merged commit a2a4756 into elastic:master Mar 16, 2020
markharwood added a commit to markharwood/elasticsearch that referenced this pull request Mar 16, 2020
Indexes values using size 3 ngrams and also stores the full original as a binary doc value.
Wildcard queries operate by using a cheap approximation query on the ngram field followed up by a more expensive verification query using an automaton on the binary doc values.  Also supports aggregations and sorting.
markharwood added a commit that referenced this pull request Mar 16, 2020
* New wildcard field optimised for wildcard queries (#49993)

Indexes values using size 3 ngrams and also stores the full original as a binary doc value.
Wildcard queries operate by using a cheap approximation query on the ngram field followed up by a more expensive verification query using an automaton on the binary doc values.  Also supports aggregations and sorting.
@jimczi jimczi added the v7.9.0 label Jun 15, 2020
@ebeahan ebeahan mentioned this pull request Jul 20, 2020
2 tasks
russcam added a commit to elastic/elasticsearch-net that referenced this pull request Jul 29, 2020
russcam added a commit to elastic/elasticsearch-net that referenced this pull request Jul 31, 2020
github-actions bot pushed a commit to elastic/elasticsearch-net that referenced this pull request Jul 31, 2020
github-actions bot pushed a commit to elastic/elasticsearch-net that referenced this pull request Jul 31, 2020
russcam added a commit to elastic/elasticsearch-net that referenced this pull request Jul 31, 2020
russcam added a commit to elastic/elasticsearch-net that referenced this pull request Jul 31, 2020
@astefan astefan mentioned this pull request Sep 9, 2020
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>feature release highlight :Search Foundations/Mapping Index mappings, including merging and defining field types v7.9.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Faster wildcard search
7 participants