Respect the ignore_above option in field retrieval API. #57307

jtibshirani · 2020-05-28T18:49:02Z

For keyword-style fields, if the source value is larger than ignore_above
then we don't retrieve the field. In particular, the field is treated as if the
value didn't exist.

elasticmachine · 2020-05-28T18:49:04Z

Pinging @elastic/es-search (:Search/Search)

nik9000

LGTM! I'm kind of surprised ignore_above is done in terms of java string length. But it is!

jtibshirani · 2020-05-28T19:44:30Z

Thanks @nik9000! I'm going to wait to merge until I have a chance to catch up with the search team. I want to discuss what ignore_above means more broadly in the search API to make sure we're on the same page -- it's no longer just 'we won't index this value'.

jtibshirani · 2020-05-28T20:09:26Z

@elasticmachine run elasticsearch-ci/default-distro

astefan

LGTM

jimczi · 2020-05-29T15:09:45Z

I am not sure why we want to respect ignore_above since the value is explicitly retrieved from _source ? What's the intent of this change ? Is it to make sure that we don't return a value if it was not indexed ? That seems out of the scope for this API imo.

jtibshirani · 2020-05-29T16:45:29Z

@jimczi thanks for looking ! I agree that this is a departure from what ignore_above usually means -- in the docs we say "Strings longer than the ignore_above setting will not be indexed or stored." I think there is a general question about what options like ignore_above should mean in the search API, I plan to discuss the broader issue with the team.

Here's more context on this change: the 'fields retrieval' API doesn't claim that it always loads from source. It's a higher-level API, and I think it would be fine if in the future we decided to load from docvalues for some cases. Generally users don't need to care where exactly the data is coming from.

Highlighting is also a high-level fetch API where data can be pulled from multiple places. In that case, we decided to respect the ignore_above option when the data is loaded from source (#43800). So to me ignore_above has started to mean "this field value is too big to work with successfully, and should be ignored when executing a search". I think this is easier for users to reason about than "ignore_above prevents the value from being written as an indexed or doc values field", since users shouldn't need to think so much about the data formats.

@astefan I know that SQL decided to respect ignore_above when returning values. Do you have any ideas to add?

nik9000 · 2020-05-29T19:22:19Z

I think of this API as sort of emulating doc_values so I support respecting ignore_above. But I'm certainly willing to be convinced otherwise.

astefan · 2020-05-30T04:40:57Z

I look at this API as a high level one where the user doesn't want to be concerned where Elasticsearch is taking the values for a field from (_source/doc_values).
Imo, there are two use cases here: return something from _source, but do the search on other fields, or return from doc_values and do the search on whatever fields. My gut feeling is that the first scenario is less common than the second?

In the case of ignore_above, if the field value was not indexed because of it, the only useful scenario of this field is just value retrieval without any search or aggregation done on it.

In the particular case of SQL/EQL, we'd want to get a null for this field because if we search on it, it will behave like a null value when doing so. Retrieving a value for something that will not match a query (ie SELECT keyword_value FROM test WHERE keyword_value IS NULL returning a non-null value) is a breaking of contract.

jtibshirani · 2020-06-02T17:03:16Z

Thanks everyone for weighing in. I've marked this as 'team-discuss' to make it clear we are planning to discuss it as a group.

jpountz · 2020-06-03T17:37:56Z

Some more thoughts: if you want to completely ignore too long values, the right way to do this is to configure an ingest pipeline that removes values if they are above a given length. My gut feeling is that ignore_above is only really useful in the context of multi-fields, e.g. if you have a keyword variant that ignores long values and a text variant that doesn't, as you can't remove the field from the _source in that case. So if we found ways to make multi-fields less useful (related to discussions we're having about the wildcard field), maybe we would no longer need ignore_above.

We're discussing ignore_above here but I think that the decision we make should be the same for ignore_above and ignore_malformed, which are just different ways of ignoring a value.

To me the question raised in this issue boils down to whether we should make the fields API consistent with queries and aggregations, which would ignore values whose length is above ignore_above, or with _source which would retain all values. I feel like we made the decision to be consistent with queries are aggregations already by resolving actual field values for targets of copy_to and multi-fields, so I'm leaning towards not returning values whose length is above ignore_above with this new API. Not a strong feeling though, I could be convinced otherwise.

jtibshirani · 2020-06-10T21:36:02Z

We discussed as a group, and decided that we want to support ignore_malformed, ignore_above, and null_value. The reasoning is based on what @astefan and @jpountz mentioned above: we want the API to be consistent with queries and aggregations.

I added a summary of the discussion here: #55363 (comment).

For keyword-style fields, if the source value is larger than ignore_above then we don't retrieve the field.

jtibshirani · 2020-06-10T22:19:03Z

Since I had merged another PR to field-retrieval that conflicted with this one, I rebased and force-pushed. There weren't any big changes.

jtibshirani · 2020-06-10T22:54:19Z

@elasticmachine run elasticsearch-ci/1

For keyword-style fields, if the source value is larger than `ignore_above` then we don't retrieve the field. In particular, the field is treated as if the value didn't exist.

jtibshirani added >enhancement :Search/Search Search-related issues that do not fall into other categories labels May 28, 2020

elasticmachine added the Team:Search Meta label for search team label May 28, 2020

jtibshirani requested review from astefan and nik9000 May 28, 2020 18:53

jtibshirani force-pushed the field-retrieval branch from 56d37f3 to 5a1ffd4 Compare May 28, 2020 19:33

jtibshirani force-pushed the ignore-above branch from cd3ae11 to d736737 Compare May 28, 2020 19:34

nik9000 approved these changes May 28, 2020

View reviewed changes

jtibshirani mentioned this pull request May 28, 2020

Search 'fields' option design + implementation #55363

Closed

10 tasks

astefan approved these changes May 29, 2020

View reviewed changes

jtibshirani added the team-discuss label Jun 2, 2020

jtibshirani force-pushed the field-retrieval branch from 5a1ffd4 to f703951 Compare June 8, 2020 21:49

jtibshirani force-pushed the ignore-above branch from d736737 to bf04cbc Compare June 10, 2020 22:15

Respect the ignore_above option.

e745bfa

For keyword-style fields, if the source value is larger than ignore_above then we don't retrieve the field.

jtibshirani force-pushed the ignore-above branch from bf04cbc to e745bfa Compare June 10, 2020 22:17

jtibshirani merged commit 8131b5e into elastic:field-retrieval Jun 10, 2020

jtibshirani deleted the ignore-above branch June 10, 2020 23:31

jtibshirani removed the team-discuss label Jun 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Respect the ignore_above option in field retrieval API. #57307

Respect the ignore_above option in field retrieval API. #57307

jtibshirani commented May 28, 2020 •

edited

Loading

elasticmachine commented May 28, 2020

nik9000 left a comment

jtibshirani commented May 28, 2020 •

edited

Loading

jtibshirani commented May 28, 2020

astefan left a comment

jimczi commented May 29, 2020

jtibshirani commented May 29, 2020 •

edited

Loading

nik9000 commented May 29, 2020

astefan commented May 30, 2020

jtibshirani commented Jun 2, 2020

jpountz commented Jun 3, 2020

jtibshirani commented Jun 10, 2020 •

edited

Loading

jtibshirani commented Jun 10, 2020

jtibshirani commented Jun 10, 2020

Respect the ignore_above option in field retrieval API. #57307

Respect the ignore_above option in field retrieval API. #57307

Conversation

jtibshirani commented May 28, 2020 • edited Loading

elasticmachine commented May 28, 2020

nik9000 left a comment

Choose a reason for hiding this comment

jtibshirani commented May 28, 2020 • edited Loading

jtibshirani commented May 28, 2020

astefan left a comment

Choose a reason for hiding this comment

jimczi commented May 29, 2020

jtibshirani commented May 29, 2020 • edited Loading

nik9000 commented May 29, 2020

astefan commented May 30, 2020

jtibshirani commented Jun 2, 2020

jpountz commented Jun 3, 2020

jtibshirani commented Jun 10, 2020 • edited Loading

jtibshirani commented Jun 10, 2020

jtibshirani commented Jun 10, 2020

jtibshirani commented May 28, 2020 •

edited

Loading

jtibshirani commented May 28, 2020 •

edited

Loading

jtibshirani commented May 29, 2020 •

edited

Loading

jtibshirani commented Jun 10, 2020 •

edited

Loading