Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exists query #3710

Merged
merged 1 commit into from
Aug 7, 2023
Merged

Exists query #3710

merged 1 commit into from
Aug 7, 2023

Conversation

fulmicoton
Copy link
Contributor

@fulmicoton fulmicoton commented Aug 4, 2023

Field exists does not consider types, only field names.
Field capability will have to be handled differently unfortunately.

This works by introducing an internal (but normal) "u64" field
that stores postings list for field existence.

For performance/RAM reasons, the fields full path is not stored
as a string but instead we compute a u64-fnv hash using the
path from root to leaf.

If the hash perfects ideally, even with the anniversary attach, collisions
are very unlikely.

When dealing with complex JSON with the raw tokenizer this feature can
double the number of tokens we deal with, and has an impact on
performance.

For this reason, it is not added as an option in the DocMapper.

Like Elasticsearch, we only store field existence of indexed fields.
Also in order to handle refinement like expand_dots,
we work over the built tantivy Document and reuse the existing
resolution logic.

On 1.4GB of gharchive (which is close to a worst case scenaio),
see the following performance/index size change:

With field_exists enabled

  • Indexing Throughput: 41 MB/s
  • Index size: 701M

With field_exists disabled

  • Indexing Throughput: 46 MB/s
  • Index size: 698M

@fulmicoton fulmicoton force-pushed the exists-query-exploring-hashing branch 5 times, most recently from 6cc9612 to 04cc9d3 Compare August 4, 2023 16:20
@fulmicoton fulmicoton changed the title Exists query exploring hashing Exists query Aug 4, 2023
@fulmicoton fulmicoton marked this pull request as ready for review August 4, 2023 16:21
@fulmicoton fulmicoton requested a review from imotov August 4, 2023 16:21
Copy link
Collaborator

@imotov imotov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's an interesting approach. Considering that we are dealing with a small number of fields the probability of collisions is really tiny here. So, that should probably work. The failure will be confusing as hell though :)

The implementation looks good to me, but I was a bit confused by naming. Having "field_exists": true, on the mapping level next to field_mappings through me off for a bit, even though I knew what this PR is about. Based on the code it also looks like we are trying to mimic the behavior of elasticsearch which checks for existence of the field value not existence of a field. But unlike elasticsearch we are adding a value catalog field instead of checking the existing index. So it is not free and therefore we need this switch.

I think this parameter is somewhat similar to the store_source parameter. So to be consistent we can replace field_exists with index_value_presence, store_value_catalog, add_value_presence_field or something like this.

LGTM assuming field_exists is renamed into something that starts with verb and has a word value in it. It would be also good to add some documentation.

output: &mut FnvHashSet<u64>,
) {
match json_value {
JsonValue::Null => {}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it's not about field presence, but rather about value presence, which is probably the right thing to do, we just need to adjust naming a bit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes the name is really garbage. It was my queen's duck. https://bwiggs.com/notebook/queens-duck/

More seriously I am unhappy about all names.
index_value_presence might be my favorite in your list.
Francois also offered index_field_presence.

I am not sure the semantic battle between field vs value confusion is that important here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally dealt with several users trying to use exist query to differentiate between documents where field is not present and where field is present but contains null and complaining that this query doesn't work even though the docs explicitly said "documents where a specific field has a value". So, I would go a bit father and also rename the query as well into has_value or not_null or something like this.

@@ -40,10 +44,11 @@ use crate::query_builder::build_query;
use crate::routing_expression::RoutingExpr;
use crate::{
Cardinality, DocMapper, DocParsingError, Mode, QueryParserError, TokenizerEntry, WarmupInfo,
DYNAMIC_FIELD_NAME, SOURCE_FIELD_NAME,
DYNAMIC_FIELD_NAME, FIELD_PRESENCE_FIELD_NAME, SOURCE_FIELD_NAME,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VALUE_PRESENCE_FIELD_NAME?

@@ -378,6 +391,58 @@ fn extract_single_obj(
}
}

#[inline]
fn populate_field_present_val(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pupulate_value_present_field_val?

@fulmicoton fulmicoton force-pushed the exists-query-exploring-hashing branch 2 times, most recently from 4bd81f1 to 8a59e05 Compare August 7, 2023 06:12
Field exists does not consider types, only field names.
Field capability will have to be handled differently unfortunately.

This works by introducing an internal (but normal) "u64" field
that stores postings list for field existence.

For performance/RAM reasons, the fields full path is not stored
as a string but instead we compute a u64-fnv hash using the
path from root to leaf.

If the hash perfects ideally, even with the anniversary attach, collisions
are very unlikely.

When dealing with complex JSON with the raw tokenizer this feature can
double the number of tokens we deal with, and has an impact on
performance.

For this reason, it is not added as an option in the DocMapper.

Like Elasticsearch, we only store field existence of indexed fields.
Also in order to handle refinement like expand_dots,
we work over the built tantivy Document and reuse the existing
resolution logic.

On 1.4GB of gharchive (which is close to a worst case scenaio),
see the following performance/index size change:

With field_exists enabled
- Indexing Throughput: 41 MB/s
- Index size: 701M

With field_exists disabled
- Indexing Throughput: 46 MB/s
- Index size: 698M
@fulmicoton fulmicoton force-pushed the exists-query-exploring-hashing branch from 8a59e05 to e4d2c62 Compare August 7, 2023 06:38
@fulmicoton fulmicoton enabled auto-merge (squash) August 7, 2023 06:38
@fulmicoton fulmicoton merged commit afc81b0 into main Aug 7, 2023
3 checks passed
@fulmicoton fulmicoton deleted the exists-query-exploring-hashing branch August 7, 2023 06:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants