Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search 'fields' option design + implementation #55363

Closed
10 tasks done
jtibshirani opened this issue Apr 16, 2020 · 12 comments
Closed
10 tasks done

Search 'fields' option design + implementation #55363

jtibshirani opened this issue Apr 16, 2020 · 12 comments
Assignees
Labels
>feature Meta :Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team

Comments

@jtibshirani
Copy link
Contributor

jtibshirani commented Apr 16, 2020

Original issue: #49028
Feature branch: field-retrieval
Docs: https://www.elastic.co/guide/en/elasticsearch/reference/7.x/search-fields.html

Motivation

Often a user wants to retrieve a particular set of fields during a search. Currently, we don't support this usage pattern in a good way. In short, given a list of fields, there is no easy way to load all of their values:

  • We can’t load all of them from doc values. Some fields like text fields may not have doc values at all, or we may exceed the limit for a reasonable number of doc value fields to load.
  • It’s not easy to load all of them through source. For example, if the field is a field alias, it’s difficult to determine where to find its value in the source.

Better field retrieval support is becoming even more important now that we're introducing more field types that don’t fit the typical pattern like constant_keyword and the proposed runtime fields (#48063).

Feature Summary

We plan to add a new fields section to the search request, which users would specify instead of using source filtering to load fields from source:

POST logs-*/_search
{
  "query": { "match_all": {} },
  "fields": [
    "file.*",
    {
      "field": "event.timestamp",
      "format": "epoch_millis"
    },
    ...
  ]
}

Both full field names and wildcard patterns are accepted. Only leaf fields are returned, the API will not allow for fetching object values. The fields are returned as a flat list in the fields section in each hit, the same as we do for docvalue_fields and script_fields.

Overall, the API gives a friendly way to load fields from source:

  • If a non-standard field like a field alias, multi-field, or constant_keyword is specified in fields, then we’ll consult the mappings to find and return the right value.
  • The fields are returned in a flat list, as opposed to structured JSON.
  • For date and numeric field types, we would support the same format parameter as we do for docvalue_fields to allow for adjusting the format of the results.
  • Each value would be returned in a 'canonical' format -- for example if a field is mapped as an integer, it will be returned as an integer even if it was specified as a string in the _source.

Some clarifications:

  • In this first pass, the API will not attempt to load from stored fields or doc values.
  • For simplicity of parsing, values will always be returned in an array, even if there is only one value present.

Implementation Plan

Future improvements:

Open Questions

  • If a wildcard pattern matches both a parent field and one of its multi-fields, should we just return the parent to avoid returning the same value twice? A similar question holds for field aliases and their target fields.
  • Should the API return fields in _source that have been disabled in the mappings (enabled: false)?
  • For keyword fields, should we apply the normalizer or return the original value?
@jtibshirani jtibshirani added >feature :Search/Search Search-related issues that do not fall into other categories Meta labels Apr 16, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (:Search/Search)

@jtibshirani jtibshirani self-assigned this Apr 16, 2020
@jtibshirani jtibshirani changed the title High-level field retrieval API design + implementation Field retrieval API design + implementation Apr 28, 2020
jtibshirani added a commit that referenced this issue Apr 29, 2020
This commit adds the capability to `FieldTypeLookup` to retrieve a field's
paths in the _source. When retrieving a field's values, we consult these
source paths to make sure we load the relevant values. This allows us to handle
requests for field aliases and multi-fields.

We also retrieve values that were copied into the field through copy_to. To me
this is what users would expect out of the API, and it's consistent with what
comes back from `docvalues_fields` and `stored_fields`. However it does add
some complexity, and was not something flagged as important from any of the
clients I spoke to about this API. I'm looking for feedback on this point.

Relates to #55363.
@rjernst rjernst added the Team:Search Meta label for search team label May 4, 2020
jtibshirani added a commit that referenced this issue May 5, 2020
This commit adds the capability to `FieldTypeLookup` to retrieve a field's
paths in the _source. When retrieving a field's values, we consult these
source paths to make sure we load the relevant values. This allows us to handle
requests for field aliases and multi-fields.

We also retrieve values that were copied into the field through copy_to. To me
this is what users would expect out of the API, and it's consistent with what
comes back from `docvalues_fields` and `stored_fields`. However it does add
some complexity, and was not something flagged as important from any of the
clients I spoke to about this API. I'm looking for feedback on this point.

Relates to #55363.
jtibshirani added a commit that referenced this issue May 7, 2020
This commit adds the capability to `FieldTypeLookup` to retrieve a field's
paths in the _source. When retrieving a field's values, we consult these
source paths to make sure we load the relevant values. This allows us to handle
requests for field aliases and multi-fields.

We also retrieve values that were copied into the field through copy_to. To me
this is what users would expect out of the API, and it's consistent with what
comes back from `docvalues_fields` and `stored_fields`. However it does add
some complexity, and was not something flagged as important from any of the
clients I spoke to about this API. I'm looking for feedback on this point.

Relates to #55363.
jtibshirani added a commit that referenced this issue May 18, 2020
This commit adds the capability to `FieldTypeLookup` to retrieve a field's
paths in the _source. When retrieving a field's values, we consult these
source paths to make sure we load the relevant values. This allows us to handle
requests for field aliases and multi-fields.

We also retrieve values that were copied into the field through copy_to. To me
this is what users would expect out of the API, and it's consistent with what
comes back from `docvalues_fields` and `stored_fields`. However it does add
some complexity, and was not something flagged as important from any of the
clients I spoke to about this API. I'm looking for feedback on this point.

Relates to #55363.
jtibshirani added a commit that referenced this issue May 20, 2020
This commit adds the capability to `FieldTypeLookup` to retrieve a field's
paths in the _source. When retrieving a field's values, we consult these
source paths to make sure we load the relevant values. This allows us to handle
requests for field aliases and multi-fields.

We also retrieve values that were copied into the field through copy_to. To me
this is what users would expect out of the API, and it's consistent with what
comes back from `docvalues_fields` and `stored_fields`. However it does add
some complexity, and was not something flagged as important from any of the
clients I spoke to about this API. I'm looking for feedback on this point.

Relates to #55363.
jtibshirani added a commit that referenced this issue May 28, 2020
This commit adds the capability to `FieldTypeLookup` to retrieve a field's
paths in the _source. When retrieving a field's values, we consult these
source paths to make sure we load the relevant values. This allows us to handle
requests for field aliases and multi-fields.

We also retrieve values that were copied into the field through copy_to. To me
this is what users would expect out of the API, and it's consistent with what
comes back from `docvalues_fields` and `stored_fields`. However it does add
some complexity, and was not something flagged as important from any of the
clients I spoke to about this API. I'm looking for feedback on this point.

Relates to #55363.
jtibshirani added a commit that referenced this issue May 28, 2020
This commit adds the capability to `FieldTypeLookup` to retrieve a field's
paths in the _source. When retrieving a field's values, we consult these
source paths to make sure we load the relevant values. This allows us to handle
requests for field aliases and multi-fields.

We also retrieve values that were copied into the field through copy_to. To me
this is what users would expect out of the API, and it's consistent with what
comes back from `docvalues_fields` and `stored_fields`. However it does add
some complexity, and was not something flagged as important from any of the
clients I spoke to about this API. I'm looking for feedback on this point.

Relates to #55363.
jtibshirani added a commit to jtibshirani/elasticsearch that referenced this issue Jun 2, 2020
This PR replaces the marker interface with the method
FieldMapper#parsesArrayValue. I find this cleaner and it will help with the
fields retrieval work (elastic#55363).

The refactor also ensures that only field mappers can declare they parse array
values. Previously other types like ObjectMapper could implement the marker
interface and be passed array values, which doesn't make sense.
jtibshirani added a commit that referenced this issue Jun 3, 2020
This PR replaces the marker interface with the method
FieldMapper#parsesArrayValue. I find this cleaner and it will help with the
fields retrieval work (#55363).

The refactor also ensures that only field mappers can declare they parse array
values. Previously other types like ObjectMapper could implement the marker
interface and be passed array values, which doesn't make sense.
jtibshirani added a commit to jtibshirani/elasticsearch that referenced this issue Jun 3, 2020
This PR replaces the marker interface with the method
FieldMapper#parsesArrayValue. I find this cleaner and it will help with the
fields retrieval work (elastic#55363).

The refactor also ensures that only field mappers can declare they parse array
values. Previously other types like ObjectMapper could implement the marker
interface and be passed array values, which doesn't make sense.
jtibshirani added a commit to jtibshirani/elasticsearch that referenced this issue Jun 3, 2020
This PR replaces the marker interface with the method
FieldMapper#parsesArrayValue. I find this cleaner and it will help with the
fields retrieval work (elastic#55363).

The refactor also ensures that only field mappers can declare they parse array
values. Previously other types like ObjectMapper could implement the marker
interface and be passed array values, which doesn't make sense.
jtibshirani added a commit that referenced this issue Jun 3, 2020
This PR replaces the marker interface with the method
FieldMapper#parsesArrayValue. I find this cleaner and it will help with the
fields retrieval work (#55363).

The refactor also ensures that only field mappers can declare they parse array
values. Previously other types like ObjectMapper could implement the marker
interface and be passed array values, which doesn't make sense.
jtibshirani added a commit that referenced this issue Jun 3, 2020
This PR replaces the marker interface with the method
FieldMapper#parsesArrayValue. I find this cleaner and it will help with the
fields retrieval work (#55363).

The refactor also ensures that only field mappers can declare they parse array
values. Previously other types like ObjectMapper could implement the marker
interface and be passed array values, which doesn't make sense.
jtibshirani added a commit that referenced this issue Jun 8, 2020
This commit adds the capability to `FieldTypeLookup` to retrieve a field's
paths in the _source. When retrieving a field's values, we consult these
source paths to make sure we load the relevant values. This allows us to handle
requests for field aliases and multi-fields.

We also retrieve values that were copied into the field through copy_to. To me
this is what users would expect out of the API, and it's consistent with what
comes back from `docvalues_fields` and `stored_fields`. However it does add
some complexity, and was not something flagged as important from any of the
clients I spoke to about this API. I'm looking for feedback on this point.

Relates to #55363.
jtibshirani added a commit that referenced this issue Jun 26, 2020
This commit adds the capability to `FieldTypeLookup` to retrieve a field's
paths in the _source. When retrieving a field's values, we consult these
source paths to make sure we load the relevant values. This allows us to handle
requests for field aliases and multi-fields.

We also retrieve values that were copied into the field through copy_to. To me
this is what users would expect out of the API, and it's consistent with what
comes back from `docvalues_fields` and `stored_fields`. However it does add
some complexity, and was not something flagged as important from any of the
clients I spoke to about this API. I'm looking for feedback on this point.

Relates to #55363.
@jtibshirani
Copy link
Contributor Author

jtibshirani commented Jun 30, 2020

I thought more about the question of whether we should apply normalizer before returning a keyword, and to me it makes sense to apply normalization by default:

  • The actual name normalizer suggests that it is performing standardization, which is exactly the sort of thing we claim to do when returning values. I was hesitant earlier because normalization felt more like text analysis to me, but I think it's typically value standardization (and we just happen to re-use the analyzer framework).
  • It's most consistent with other parts of the search response, in particular terms aggregations.
  • If the original value is actually interesting in terms of casing, accents, etc., then a user will probably avoid normalizing, and instead perform a case-insensitive search (coming soon :)). Or, they would set up a multi-field where one field contains the original, and the other a normalized form.

Tagging @jpountz @jimczi @javanna @nik9000 in case they have thoughts on the above, happy to discuss here!

jtibshirani added a commit that referenced this issue Jun 30, 2020
This commit adds the capability to `FieldTypeLookup` to retrieve a field's
paths in the _source. When retrieving a field's values, we consult these
source paths to make sure we load the relevant values. This allows us to handle
requests for field aliases and multi-fields.

We also retrieve values that were copied into the field through copy_to. To me
this is what users would expect out of the API, and it's consistent with what
comes back from `docvalues_fields` and `stored_fields`. However it does add
some complexity, and was not something flagged as important from any of the
clients I spoke to about this API. I'm looking for feedback on this point.

Relates to #55363.
@jimczi
Copy link
Contributor

jimczi commented Jun 30, 2020

+1 to support normalizer ootb. As you said it's consistent with aggregations and script. That would also make the docvalue_fields alternative less appealing, which I find important since we want to simplify the reasoning when retrieving fields.

@javanna
Copy link
Member

javanna commented Jul 1, 2020

@jtibshirani when you say "by default" does that mean it can be disabled? From the point of view of "when you retrieve, you get what you sent, when you search and aggregate, you search on and get back what was indexed" I am torn, I would personally expect the raw value loaded from source. Though if users can control what they get, I would not mind that the default is normalized.

@jtibshirani
Copy link
Contributor Author

@javanna I was indeed wondering if we could make the behavior configurable, but don't have immediate plans to do so. It's always nice to avoid configuration options and have strong defaults.

I have also been on the fence about this, I can see arguments both ways. I suggest that we move forward with normalizing the values for now. I'm going to ask the teams planning to use this feature (SQL, ML, Kibana) to try to integrate with it before we ship it, and have a short list of questions I plan to ask them which includes keyword normalization. The questions are tracked in the issue description.

@jtibshirani
Copy link
Contributor Author

jtibshirani commented Jul 16, 2020

We discussed how geo fields should be returned with @talevy and @imotov. A summary of our discussion:

  • Geo data should be returned in a consistent format. We accept a variety of formats during indexing, and feel it would be helpful for clients if all fields were returned in a single format.
  • The default format should be 'geojson', since this matches our usual JSON return format, and it's natural for Kibana. We should also support well-known text, which is best for SQL. Note that geo points (in addition to shapes) will also be returned in geojson.
  • The user will be able to select the format by setting format: wkt or format: geojson.
  • Ideally if the source value is already in the right format to return, we won't re-parse it to a geolib object then re-serialize it.

@jtibshirani
Copy link
Contributor Author

An additional note: now that we'll return points in geojson format by default, for consistency we should accept this format when indexing points. We currently don't allow geojson, the work to add support is tracked in #47815.

jtibshirani added a commit that referenced this issue Jul 27, 2020
…60100)

This feature adds a new `fields` parameter to the search request, which
consults both the document `_source` and the mappings to fetch fields in a
consistent way. The PR merges the `field-retrieval` feature branch.

Addresses #49028 and #55363.
jtibshirani added a commit that referenced this issue Jul 28, 2020
…60258)

This feature adds a new `fields` parameter to the search request, which
consults both the document `_source` and the mappings to fetch fields in a
consistent way. The PR merges the `field-retrieval` feature branch.

Addresses #49028 and #55363.
@jtibshirani
Copy link
Contributor Author

The feature branch was merged in #60100. I'll open new issues/ PRs for the follow-up improvements mentioned in the description.

@jtibshirani jtibshirani changed the title Field retrieval API design + implementation Search 'fields' option design + implementation Aug 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>feature Meta :Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team
Projects
None yet
Development

No branches or pull requests

6 participants