Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Respect the ignore_above option in field retrieval API. #57307

Merged
merged 1 commit into from
Jun 10, 2020

Conversation

jtibshirani
Copy link
Contributor

@jtibshirani jtibshirani commented May 28, 2020

For keyword-style fields, if the source value is larger than ignore_above
then we don't retrieve the field. In particular, the field is treated as if the
value didn't exist.

@jtibshirani jtibshirani added >enhancement :Search/Search Search-related issues that do not fall into other categories labels May 28, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (:Search/Search)

Copy link
Member

@nik9000 nik9000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I'm kind of surprised ignore_above is done in terms of java string length. But it is!

@jtibshirani
Copy link
Contributor Author

jtibshirani commented May 28, 2020

Thanks @nik9000! I'm going to wait to merge until I have a chance to catch up with the search team. I want to discuss what ignore_above means more broadly in the search API to make sure we're on the same page -- it's no longer just 'we won't index this value'.

@jtibshirani
Copy link
Contributor Author

@elasticmachine run elasticsearch-ci/default-distro

Copy link
Contributor

@astefan astefan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jimczi
Copy link
Contributor

jimczi commented May 29, 2020

I am not sure why we want to respect ignore_above since the value is explicitly retrieved from _source ? What's the intent of this change ? Is it to make sure that we don't return a value if it was not indexed ? That seems out of the scope for this API imo.

@jtibshirani
Copy link
Contributor Author

jtibshirani commented May 29, 2020

@jimczi thanks for looking ! I agree that this is a departure from what ignore_above usually means -- in the docs we say "Strings longer than the ignore_above setting will not be indexed or stored." I think there is a general question about what options like ignore_above should mean in the search API, I plan to discuss the broader issue with the team.

Here's more context on this change: the 'fields retrieval' API doesn't claim that it always loads from source. It's a higher-level API, and I think it would be fine if in the future we decided to load from docvalues for some cases. Generally users don't need to care where exactly the data is coming from.

Highlighting is also a high-level fetch API where data can be pulled from multiple places. In that case, we decided to respect the ignore_above option when the data is loaded from source (#43800). So to me ignore_above has started to mean "this field value is too big to work with successfully, and should be ignored when executing a search". I think this is easier for users to reason about than "ignore_above prevents the value from being written as an indexed or doc values field", since users shouldn't need to think so much about the data formats.

@astefan I know that SQL decided to respect ignore_above when returning values. Do you have any ideas to add?

@nik9000
Copy link
Member

nik9000 commented May 29, 2020

I think of this API as sort of emulating doc_values so I support respecting ignore_above. But I'm certainly willing to be convinced otherwise.

@astefan
Copy link
Contributor

astefan commented May 30, 2020

I look at this API as a high level one where the user doesn't want to be concerned where Elasticsearch is taking the values for a field from (_source/doc_values).
Imo, there are two use cases here: return something from _source, but do the search on other fields, or return from doc_values and do the search on whatever fields. My gut feeling is that the first scenario is less common than the second?

In the case of ignore_above, if the field value was not indexed because of it, the only useful scenario of this field is just value retrieval without any search or aggregation done on it.

In the particular case of SQL/EQL, we'd want to get a null for this field because if we search on it, it will behave like a null value when doing so. Retrieving a value for something that will not match a query (ie SELECT keyword_value FROM test WHERE keyword_value IS NULL returning a non-null value) is a breaking of contract.

@jtibshirani
Copy link
Contributor Author

Thanks everyone for weighing in. I've marked this as 'team-discuss' to make it clear we are planning to discuss it as a group.

@jpountz
Copy link
Contributor

jpountz commented Jun 3, 2020

Some more thoughts: if you want to completely ignore too long values, the right way to do this is to configure an ingest pipeline that removes values if they are above a given length. My gut feeling is that ignore_above is only really useful in the context of multi-fields, e.g. if you have a keyword variant that ignores long values and a text variant that doesn't, as you can't remove the field from the _source in that case. So if we found ways to make multi-fields less useful (related to discussions we're having about the wildcard field), maybe we would no longer need ignore_above.

We're discussing ignore_above here but I think that the decision we make should be the same for ignore_above and ignore_malformed, which are just different ways of ignoring a value.

To me the question raised in this issue boils down to whether we should make the fields API consistent with queries and aggregations, which would ignore values whose length is above ignore_above, or with _source which would retain all values. I feel like we made the decision to be consistent with queries are aggregations already by resolving actual field values for targets of copy_to and multi-fields, so I'm leaning towards not returning values whose length is above ignore_above with this new API. Not a strong feeling though, I could be convinced otherwise.

@jtibshirani
Copy link
Contributor Author

jtibshirani commented Jun 10, 2020

We discussed as a group, and decided that we want to support ignore_malformed, ignore_above, and null_value. The reasoning is based on what @astefan and @jpountz mentioned above: we want the API to be consistent with queries and aggregations.

I added a summary of the discussion here: #55363 (comment).

For keyword-style fields, if the source value is larger than ignore_above then
we don't retrieve the field.
@jtibshirani
Copy link
Contributor Author

Since I had merged another PR to field-retrieval that conflicted with this one, I rebased and force-pushed. There weren't any big changes.

@jtibshirani
Copy link
Contributor Author

@elasticmachine run elasticsearch-ci/1

@jtibshirani jtibshirani merged commit 8131b5e into elastic:field-retrieval Jun 10, 2020
@jtibshirani jtibshirani deleted the ignore-above branch June 10, 2020 23:31
jtibshirani added a commit that referenced this pull request Jun 17, 2020
For keyword-style fields, if the source value is larger than `ignore_above`
then we don't retrieve the field. In particular, the field is treated as if the
value didn't exist.
jtibshirani added a commit that referenced this pull request Jun 26, 2020
For keyword-style fields, if the source value is larger than `ignore_above`
then we don't retrieve the field. In particular, the field is treated as if the
value didn't exist.
jtibshirani added a commit that referenced this pull request Jun 30, 2020
For keyword-style fields, if the source value is larger than `ignore_above`
then we don't retrieve the field. In particular, the field is treated as if the
value didn't exist.
jtibshirani added a commit that referenced this pull request Jun 30, 2020
For keyword-style fields, if the source value is larger than `ignore_above`
then we don't retrieve the field. In particular, the field is treated as if the
value didn't exist.
jtibshirani added a commit that referenced this pull request Jul 8, 2020
For keyword-style fields, if the source value is larger than `ignore_above`
then we don't retrieve the field. In particular, the field is treated as if the
value didn't exist.
jtibshirani added a commit that referenced this pull request Jul 14, 2020
For keyword-style fields, if the source value is larger than `ignore_above`
then we don't retrieve the field. In particular, the field is treated as if the
value didn't exist.
jtibshirani added a commit that referenced this pull request Jul 15, 2020
For keyword-style fields, if the source value is larger than `ignore_above`
then we don't retrieve the field. In particular, the field is treated as if the
value didn't exist.
jtibshirani added a commit that referenced this pull request Jul 18, 2020
For keyword-style fields, if the source value is larger than `ignore_above`
then we don't retrieve the field. In particular, the field is treated as if the
value didn't exist.
jtibshirani added a commit that referenced this pull request Jul 21, 2020
For keyword-style fields, if the source value is larger than `ignore_above`
then we don't retrieve the field. In particular, the field is treated as if the
value didn't exist.
jtibshirani added a commit that referenced this pull request Jul 23, 2020
For keyword-style fields, if the source value is larger than `ignore_above`
then we don't retrieve the field. In particular, the field is treated as if the
value didn't exist.
jtibshirani added a commit to jtibshirani/elasticsearch that referenced this pull request Jul 27, 2020
For keyword-style fields, if the source value is larger than `ignore_above`
then we don't retrieve the field. In particular, the field is treated as if the
value didn't exist.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants