Wildcard field optimised for wildcard queries #49993

markharwood · 2019-12-09T18:21:20Z

First cut at the wildcard field type.
Closes #48852

elasticmachine · 2019-12-09T18:21:22Z

Pinging @elastic/es-search (:Search/Mapping)

markharwood · 2019-12-11T10:08:08Z

This PR is currently missing support for arrays. Should it add that?
Presumably wildcards expressions are expected to match within an array element (as opposed to span/interval queries on text arrays that can "run over" array elements).

jpountz

+1 to support arrays and require wildcards to match within array elements, like for keyword fields. I left some comments, I'm curious whether you've had a chance to check the space overhead compared to a keyword field?

x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/XPackSettings.java

.../wildcard/src/main/java/org/elasticsearch/xpack/wildcard/mapper/TaperedNgramTokenFilter.java

...ugin/wildcard/src/main/java/org/elasticsearch/xpack/wildcard/mapper/WildcardFieldMapper.java

jpountz · 2019-12-13T10:09:55Z

...ugin/wildcard/src/main/java/org/elasticsearch/xpack/wildcard/mapper/WildcardFieldMapper.java

+        fields.add(field);
+        Field dvField = new BinaryDocValuesField(fieldType().name(), new BytesRef(value));
+        fields.add(dvField);
+        if (fieldType().omitNorms()) {


Note that this is what we would do for a field that doesn't have doc values, like text. This field requires doc values anyway so we can skip this if statement entirely and create exists queries from doc values.

...ugin/wildcard/src/main/java/org/elasticsearch/xpack/wildcard/mapper/WildcardFieldMapper.java

jpountz · 2019-12-13T17:21:02Z

...plugin/wildcard/src/main/java/org/elasticsearch/xpack/wildcard/mapper/WildcardOnDvQuery.java

+public class WildcardOnDvQuery extends Query {
+
+    private String field;
+    private String wildcardPattern;


make them private?

sorry I meant final

jpountz · 2019-12-13T17:23:27Z

...plugin/wildcard/src/main/java/org/elasticsearch/xpack/wildcard/mapper/WildcardOnDvQuery.java

+
+                    @Override
+                    public boolean matches() throws IOException {
+                        if (values.advanceExact(approximation.docID())) {


other doc values queries instead use values as an approximation, which removes the need to call advanceExact here

jpountz · 2019-12-13T17:24:26Z

...plugin/wildcard/src/main/java/org/elasticsearch/xpack/wildcard/mapper/WildcardOnDvQuery.java

+                            long ord = values.nextOrd();                    
+                            while (ord != SortedSetDocValues.NO_MORE_ORDS) {
+                                BytesRef value = values.lookupOrd(ord);
+                                if (bytesMatcher.run(value.bytes, 0, value.length)) {


please use value.offset even if it is always 0 in practice with the current codec

markharwood · 2019-12-13T17:44:24Z

Thanks for the review, @jpountz
Just a couple of things left to do on this I think:

docs
remove the YAML test? (I had that before I got the integ test going).

jpountz · 2019-12-13T17:51:34Z

Actually our general recommendation is rather to avoid integration tests. In that case, we'd do extensive testing about whether it matches wildcard queries correctly in unit tests, and have a YAML test mostly to ensure that everything is wired together correctly. And no EsIntegTestCase.

jpountz · 2019-12-13T17:56:28Z

I'm also curious to get some more performance/disk-overhead data before doing a more in-depth review, in order to know whether we need to go back to the drawing board or whether we want to move forward with this approach?

jpountz · 2019-12-16T20:59:31Z

Thinking more about the performance comparison, I suspect we will want to check at least two scenarios:

wildcard query with no filter
wildcard query with a very selective filter (possibly an ids filter on a couple ids to test the worst-case scenario)

I'm expecting the wildcard_keyword field to be sometimes faster sometimes slower for 1 (depending on the size of the terms dict and whether there is a leading wildcard) but consistently faster for 2?

markharwood · 2019-12-18T14:25:42Z

Some preliminary results from the current wildcard-on-DV approach.
I gave up trying to index 5 char ngrams - with 1G heap it was causing excessive GC and causing timeouts - likely due to the volume of unique terms being generated.
I opted for testing with 3 char ngrams and found favourable results scanning elasticsearch log files for key event strings. The resulting index size is 650mb:

pattern	wildcard took time (ms)	keyword took time (ms)	wildcard num matches	keyword num matches
`above the warn threshold`	2	359	192	192
`nternal queue is full`	1	353	0	0
`timed out waiting for all nodes to process published state`	2	345	0	0
`now throttling indexing`	1	337	0	0
`cluster state update task`	1	338	3	3
`above the warn threshold`	2	337	192	192
`done applying updated cluster_state`	1	344	0	0
`Limit of total fields`	73	335	7886	7886
`No space left on device`	1	336	0	0
`Failed to parse query`	1	342	0	0
`] stopping ...`	1	336	25	25
`] starting ...`	1	335	42	42
`:\\elastic\\elasticsearch-6.6.1_prod\\config\\roles.yml`	2	355	6	6

Most of the pattern matches are rare but when a search matches many docs eg the 7,886 result we can see an increase in the cost of the wildcard approach.
This was done without a selective filter clause (my patterns are generally selective enough).

markharwood · 2019-12-18T14:48:07Z

I wonder if we can eliminate some of the costs involved in storing/scanning machine-generated strings (e.g. stacktraces) if we can identify common subsequences at index time.

The significant text aggregation has some logic to do this in the DeDuplicatingTokenFilter. Sequences of 6 words that are repeated in a stream of docs are identified using a low-cost lookup. The significant text aggregation removes whole sections of text that are seen as duplicate but we could potentially use the same structure to store and search more compact representations of near-duplicate strings.

markharwood · 2019-12-18T17:20:44Z

As for disk usage - my test log index was a total of 366mb for keyword only mapping Vs 398mb for wildcard_keyword mapping only (using 3 gram + dv)

markharwood · 2019-12-19T15:17:23Z

Possible issue with position-based matching when we have array fields (at least in terms of behaviour compared to wildcard search on keyword fields).
I got a false positive for the query a*aba on the doc ["ababaaa","baaabbabbbaabbaba"].
We could add a large position increment into the index for array elements but the * in query means an unlimited gap so would not work.
This could be tricky so perhaps we should just assume these implementations diverge in this behaviour?

jpountz · 2019-12-19T17:57:04Z

Hmm this is a good point. I believe using positions would also have the same downside as sloppy phrase queries have about missing matches, since they don't (and maybe can't realistically because of the combinatorial explosion) explore all matches.

jpountz · 2019-12-19T17:58:38Z

Unless @jimczi feels otherwise, then I guess positions are not really an option and we should just explore the path about indexing ngrams with docs only (no freqs or positions) and using doc values to verify?

markharwood · 2019-12-19T17:59:51Z

I'm hoping to get some numbers tomorrow on position-based search speeds so we can see what we are throwing away if we decide to not go down this route

jpountz · 2019-12-19T18:19:26Z

Ah, if you are close to having some numbers that would be great indeed!

markharwood · 2019-12-19T18:38:35Z

I've not got 100% compatibility with keyword+wildcard matching behaviour but close enough for benchmarks I expect

jimczi · 2019-12-19T18:40:32Z

I believe using positions would also have the same downside as sloppy phrase queries have about missing matches, since they don't (and maybe can't realistically because of the combinatorial explosion) explore all matches.

We only need an ordered query so I am not sure about the combinatorial explosion ? Phrase queries with slop allows some reordering so that's out of scope for this field. Intervals query would work well I think but we need to fix the case where there are duplicates in the query first:
apache/lucene-solr#1097
My take on this is that we should compare with a position-based approach since doc values would only be useful for validations and they add to the total cost especially in terms of index size.

Possible issue with position-based matching when we have array fields (at least in terms of behaviour compared to wildcard search on keyword fields).

That's a good point indeed and maybe something that we can restrict or explain in the documentation but imo this shouldn't eliminate this option completely.

I'm hoping to get some numbers tomorrow on position-based search speeds so we can see what we are throwing away if we decide to not go down this route

Speed and index size, that's my main motivation to test this since doc values seem overkill if we don't want to restrict this to small fields only.

jpountz · 2019-12-19T18:56:32Z

Ah, you are right, I got confused. The fact that terms within the phrase must be ordered avoids the combinatorial explosion issue that sloppy phrase queries have indeed.

jimczi · 2019-12-20T12:00:03Z

I got confused too ;), the issue in apache/lucene-solr#1097 does not affect strictly ordered query so using Intervals to validate our ngrams with positions should work.
So a query like abcdef*xyz could be rewritten into Intervals.ordered(Intervals.phrase(abc, bcd, cde, def), xyz) and simplified into Intervals.phrase(abc, Interval.extend(def, 2)), xyz).
I am not sure if it's a problem or not but we'd need to handle wildcard intervals too since a query like a*b contains blocks that are smaller than the ngram size. We limit the expansion in these queries to the maximum boolean clause so we might not be able to find a match for these queries using positions only. However we're leaning towards smaller ngrams (3) so this should only be a problem for blocks that have a single character before (or in-between) a wildcard.

markharwood · 2019-12-20T12:13:37Z

Index size comparisons

All force-merged copies of the same elasticsearch.log strings.

wildcardtest_kw: plain keyword
wildcard_pos: 3-gram, indexed positions, no DV
wildcard_dv: 3-gram, no positions, DV
wildcard_txt: default text field - (does not support EQL-style wildcard searches so used for size comparison only)

index	docs.count	store.size
wildcardtest_kw	`1768155`	356.8mb
wildcardtest_pos	`1768155`	963mb
wildcardtest_dv	`1768155`	387.9mb
wildcardtest_txt	`1768155`	182.4mb

jimczi · 2020-03-13T09:57:40Z

Is the issue that it should be formatted as 1 line instead of 2? I have that change in the latest code.

No, that’s about not accepting similarity and index_options in the builder: here and here.

…in Builder for invalid options,

jimczi

LGTM, thanks @markharwood for all the iterations on this!

Indexes values using size 3 ngrams and also stores the full original as a binary doc value. Wildcard queries operate by using a cheap approximation query on the ngram field followed up by a more expensive verification query using an automaton on the binary doc values. Also supports aggregations and sorting.

* New wildcard field optimised for wildcard queries (#49993) Indexes values using size 3 ngrams and also stores the full original as a binary doc value. Wildcard queries operate by using a cheap approximation query on the ngram field followed up by a more expensive verification query using an automaton on the binary doc values. Also supports aggregations and sorting.

Relates: elastic/elasticsearch#49993

Relates: elastic/elasticsearch#49993 Co-authored-by: Russ Cam <[email protected]>

markharwood added WIP :Search Foundations/Mapping Index mappings, including merging and defining field types labels Dec 9, 2019

markharwood force-pushed the fix/48852 branch 2 times, most recently from 9b297fd to 452987a Compare December 10, 2019 14:18

markharwood removed the WIP label Dec 11, 2019

markharwood force-pushed the fix/48852 branch 2 times, most recently from 414b3de to 49bcf70 Compare December 12, 2019 11:08

markharwood requested a review from jpountz December 12, 2019 14:52

jpountz reviewed Dec 13, 2019

View reviewed changes

markharwood force-pushed the fix/48852 branch from 3073697 to 96c5484 Compare December 13, 2019 17:06

jpountz reviewed Dec 13, 2019

View reviewed changes

Switched to reusing same Analyzer for all tokenisation. Added checks …

9641b72

…in Builder for invalid options,

jimczi added v7.7.0 v8.0.0 >feature labels Mar 16, 2020

jimczi approved these changes Mar 16, 2020

View reviewed changes

markharwood merged commit a2a4756 into elastic:master Mar 16, 2020

markharwood added the backport pending label Mar 16, 2020

markharwood mentioned this pull request Mar 16, 2020

Support case insensitive search on new wildcard field and keyword #53603

Closed

markharwood removed the backport pending label Mar 18, 2020

astefan mentioned this pull request Mar 25, 2020

QL: add support for wildcard field type #54184

Closed

markharwood removed v7.7.0 release highlight labels Mar 27, 2020

astefan mentioned this pull request Jun 12, 2020

QL: wildcard field type support #58062

Merged

jimczi added the v7.9.0 label Jun 15, 2020

jimczi added the release highlight label Jun 30, 2020

ebeahan mentioned this pull request Jul 20, 2020

[RFC] Wildcard Data Type elastic/ecs#890

Merged

2 tasks

russcam mentioned this pull request Jul 23, 2020

7.9.0 Meta ticket elastic/elasticsearch-net#4872

Closed

29 tasks

russcam added a commit to elastic/elasticsearch-net that referenced this pull request Jul 29, 2020

Add support for wildcard property

ad17268

Relates: elastic/elasticsearch#49993

russcam mentioned this pull request Jul 29, 2020

Add support for wildcard property elastic/elasticsearch-net#4890

Merged

russcam added a commit to elastic/elasticsearch-net that referenced this pull request Jul 31, 2020

Add support for wildcard property (#4890)

a37734f

Relates: elastic/elasticsearch#49993

github-actions bot pushed a commit to elastic/elasticsearch-net that referenced this pull request Jul 31, 2020

Add support for wildcard property (#4890)

bfaf4ba

Relates: elastic/elasticsearch#49993

github-actions bot pushed a commit to elastic/elasticsearch-net that referenced this pull request Jul 31, 2020

Add support for wildcard property (#4890)

10948f9

Relates: elastic/elasticsearch#49993

russcam added a commit to elastic/elasticsearch-net that referenced this pull request Jul 31, 2020

Add support for wildcard property (#4890) (#4900)

20e4875

Relates: elastic/elasticsearch#49993 Co-authored-by: Russ Cam <[email protected]>

russcam added a commit to elastic/elasticsearch-net that referenced this pull request Jul 31, 2020

Add support for wildcard property (#4890) (#4899)

b8c4e89

Relates: elastic/elasticsearch#49993 Co-authored-by: Russ Cam <[email protected]>

astefan mentioned this pull request Sep 9, 2020

QL: Wildcard support #59753

Closed

5 tasks

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wildcard field optimised for wildcard queries #49993

Wildcard field optimised for wildcard queries #49993

markharwood commented Dec 9, 2019

elasticmachine commented Dec 9, 2019

markharwood commented Dec 11, 2019

jpountz left a comment

jpountz Dec 13, 2019

jpountz Dec 13, 2019

jpountz Dec 13, 2019

jpountz Dec 13, 2019

jpountz Dec 13, 2019

markharwood commented Dec 13, 2019

jpountz commented Dec 13, 2019

jpountz commented Dec 13, 2019

jpountz commented Dec 16, 2019

markharwood commented Dec 18, 2019 •

edited

Loading

markharwood commented Dec 18, 2019

markharwood commented Dec 18, 2019

markharwood commented Dec 19, 2019

jpountz commented Dec 19, 2019

jpountz commented Dec 19, 2019

markharwood commented Dec 19, 2019 •

edited

Loading

jpountz commented Dec 19, 2019

markharwood commented Dec 19, 2019

jimczi commented Dec 19, 2019

jpountz commented Dec 19, 2019

jimczi commented Dec 20, 2019

markharwood commented Dec 20, 2019 •

edited

Loading

jimczi commented Mar 13, 2020

jimczi left a comment

Wildcard field optimised for wildcard queries #49993

Wildcard field optimised for wildcard queries #49993

Conversation

markharwood commented Dec 9, 2019

elasticmachine commented Dec 9, 2019

markharwood commented Dec 11, 2019

jpountz left a comment

Choose a reason for hiding this comment

jpountz Dec 13, 2019

Choose a reason for hiding this comment

jpountz Dec 13, 2019

Choose a reason for hiding this comment

jpountz Dec 13, 2019

Choose a reason for hiding this comment

jpountz Dec 13, 2019

Choose a reason for hiding this comment

jpountz Dec 13, 2019

Choose a reason for hiding this comment

markharwood commented Dec 13, 2019

jpountz commented Dec 13, 2019

jpountz commented Dec 13, 2019

jpountz commented Dec 16, 2019

markharwood commented Dec 18, 2019 • edited Loading

markharwood commented Dec 18, 2019

markharwood commented Dec 18, 2019

markharwood commented Dec 19, 2019

jpountz commented Dec 19, 2019

jpountz commented Dec 19, 2019

markharwood commented Dec 19, 2019 • edited Loading

jpountz commented Dec 19, 2019

markharwood commented Dec 19, 2019

jimczi commented Dec 19, 2019

jpountz commented Dec 19, 2019

jimczi commented Dec 20, 2019

markharwood commented Dec 20, 2019 • edited Loading

Index size comparisons

jimczi commented Mar 13, 2020

jimczi left a comment

Choose a reason for hiding this comment

markharwood commented Dec 18, 2019 •

edited

Loading

markharwood commented Dec 19, 2019 •

edited

Loading

markharwood commented Dec 20, 2019 •

edited

Loading