Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A high level way of retrieving values for certain fields #49028

Closed
dimitris-athanasiou opened this issue Nov 13, 2019 · 10 comments
Closed

A high level way of retrieving values for certain fields #49028

dimitris-athanasiou opened this issue Nov 13, 2019 · 10 comments
Assignees
Labels
>feature :Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team

Comments

@dimitris-athanasiou
Copy link
Contributor

Describe the feature:

More and more use cases arise that treat elasticsearch as a data store. Yet the landscape for retrieving fields today is complex. In fact, it requires expertise about a lot of different aspects. One needs to understand mappings, doc_values, stored fields. Complexities like becoming aware of the max doc_value field limit and then working around it by detecting a user requested more fields and trying to fetch them from _source instead.

Then, of course, there is multi-fields. Which variant should I pick? How do I even detect that a field has multi-fields in order to avoid retrieving the same field multiple times? There is an answer to this of course (check there is a parent field that is not an object) but this is hopefully illustrating how complex this is.

Writing code to do this for ML I have multiple stories about the complexities that arise. I think other users must have gone through a similar process.

I propose a new API that simply retrieves values given a list of fields. The API does not intend to do this in the most performant way. Rather, it intends to do it in the most user friendly way. It is an API that targets users that do not know the inner workings of elasticsearch and that have not yet detected a performance issue so that they begin an optimization journey (see "is it faster to retrieve from _source or doc_values" types of questions).

@dimitris-athanasiou dimitris-athanasiou added >feature :Search/Search Search-related issues that do not fall into other categories labels Nov 13, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (:Search/Search)

@jimczi
Copy link
Contributor

jimczi commented Nov 18, 2019

We discussed this issue in our search meeting and we've spotted two enhancements that could help to retrieve values more easily:

  • The field_caps API should expose the source path of the field if it's not present in the _source (alias, multi-fields, ...): Add source_path information to field_caps API #49264
  • The format of values when retrieving the _source should be customizable in order to allow a date for instance to be returned as a timestamp since epoch rather than a string. This feature would be equivalent to the format option of the docvalues_field but it would be applied in the original source directly.

@costin
Copy link
Member

costin commented Feb 10, 2020

Discussed in the meeting today, adding team-discuss to clarify the remaining scope once @jimczi is back (are we okay with the current plan or do we need to do a higher level api to handle the retrieval).

@joshdevins
Copy link
Member

I can imagine this as being necessary as well for feature extraction for our planned LTR work, both at training and inference time to extract document only features (i.e. features that are not query/context dependent).
/cc @davidkyle @jtibshirani

@wylieconlon
Copy link

We have run into this problem in Kibana, where we are primarily asking users to interact with dotted field names like system.cpu.user.pct or url.keyword in building their visualizations.
Because the dotted names are what we train users to see, we keep a cache of the dotted names from the field_caps API (the index pattern object), and use this when asking users to build queries or visualizations. Why don't the _search APIs construct dotted paths for us?

Proposal: Add a new parameter fields to the _search API which implements the high-level retrieval described here, combining the behavior of _source and docvalue_fields. It is important for use in Kibana to support unlimited wildcards. It is important for us to be able to display the entire document using a query like fields: '*' or fields: ['system.cpu.*'].

The kibana sample data contains both text and keyword mappings, and is a good illustration of the response shape that I would expect:

POST kibana_sample_data_logs/_search
{
  "query": { "match_all": {} },
  "_source": "",
  "fields": [{ "field": "*" }],
  "size": 10
}
"fields": {
  "bytes": [ 8679 ],
  "extension": "",
  "extension.keyword" : [ "" ],
  "geo.coordinates" : [ "32.69899999257177, -94.94886112399399" ],
  "geo.src" : [ "CN" ],
  "geo.dest" : [ "IT" ],
  "geo.srcdest" : [ "CN:IT" ]
  "host": "www.elastic.co",
  "host.keyword" : [ "www.elastic.co" ],
  "machine.os" : "win xp",
  "machine.os.keyword" : [ "win xp" ],
  "machine.ram" : 11811160064,
  "response": 200,
  "response.keyword": ["200"],
  "tags": ["success","info"],
  "tags.keyword": ["info", "success", "info", "success"],
}

The example request is easy to write for any user of Elasticsearch, and the response contains information that is from both doc_values and _source. This is a simple, high-level API that we could work with. Unfortunately, this isn't possible by combining any of the APIs that exist today for a few reasons.

Limitations of current APIs

I have been testing with ECS-based schemas like metricbeat, which on my cluster contains 3904 named paths in the mapping. Not all of these fields are actively used, but because the mapping is so large it causes problems. Here are the limitations I've found

  1. _source: "*" does not include multi-mapped or alias fields
  2. Making a _source request with a list of 3904 paths like _source: [...] causes the error:
    {
      "type" : "too_complex_to_determinize_exception",
      "reason" : "Determinizing automaton with 235539 states and 239442 transitions would result in more than 10000 states."
    }
    
  3. It's not possible to get all docvalues with a wildcard on small indices. The query docvalue_fields: [{ field: "*" }] throws an error if there are any text fields at all:

    Fielddata is disabled on text fields by default. Set fielddata=true on [request] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead.

  4. It's not possible to get all docvalues on a large mapping like metricbeat. The request docvalue_fields: [{ field: "*" }] causes the error

    Trying to retrieve too many docvalue_fields. Must be less than or equal to: [100] but was [2588]. This limit can be set by changing the [index.max_docvalue_fields_search] index level setting.

  5. Listing too many paths in the request for docvalue_fields also causes the same error:

    Trying to retrieve too many docvalue_fields. Must be less than or equal to: [100] but was [3900]. This limit can be set by changing the [index.max_docvalue_fields_search] index level setting.

All of these limitations make it hard to avoid using _source.

@jtibshirani
Copy link
Contributor

I caught up with @jimczi offline to clarify our earlier discussion. Instead of immediately pushing ahead with the source_path (#49264) and formatters changes, we'd like to step back and consider the problem in a more end-to-end way. Like this, we can consider a coordinated API change that addresses the use case in a more direct + user-friendly way.

We can continue the discussion about field retrieval on this issue, building on @wylieconlon's helpful analysis. I'll remove 'team discuss' for now, but we can add it back if there's a particular item we'd like to discuss in person.

@jpountz
Copy link
Contributor

jpountz commented Mar 6, 2020

+1 to move forward with something along the lines of @wylieconlon 's above proposal.

@jtibshirani
Copy link
Contributor

Great, I've assigned this to myself and am working on a design doc. Once the design is more settled I'll post it here or open a new meta-issue.

@jtibshirani
Copy link
Contributor

jtibshirani commented Apr 17, 2020

I opened a meta-issue to track implementation details: #55363.

jtibshirani added a commit that referenced this issue Jul 27, 2020
…60100)

This feature adds a new `fields` parameter to the search request, which
consults both the document `_source` and the mappings to fetch fields in a
consistent way. The PR merges the `field-retrieval` feature branch.

Addresses #49028 and #55363.
jtibshirani added a commit that referenced this issue Jul 28, 2020
…60258)

This feature adds a new `fields` parameter to the search request, which
consults both the document `_source` and the mappings to fetch fields in a
consistent way. The PR merges the `field-retrieval` feature branch.

Addresses #49028 and #55363.
@jtibshirani
Copy link
Contributor

Closing, since the feature branch was merged in #60100.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>feature :Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team
Projects
None yet
Development

No branches or pull requests

9 participants