-
Notifications
You must be signed in to change notification settings - Fork 24.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support a keyed histogram #100242
Comments
Pinging @elastic/es-analytics-geo (Team:Analytics) |
@martijnvg came up with the idea to add add a dedicated field type which encodes the values in a way to keep track of duplicates (this will make the format less efficient). The new field mapper type will be similar to a With this new field type in place, we can then just use the terms aggregation to retrieve data in the desired format. |
With this commit we add support for "keyed histograms". This is achieved by adding a new mapping type `counted_keyword` and the corresponding `counted_terms` aggregation. The new mapping type keeps track of actual counts of duplicate multi-valued fields and the aggregation considers them. Example: ``` PUT /test { "mappings": { "properties": { "event_ids": { "type": "counted_keyword" } } } } POST /test/_doc { "event_ids": ["a", "b", "c", "a", "a"] } GET /test/_search { "aggs": { "events": { "counted_terms": { "field": "event_ids" } } } } ``` This results in buckets that consider duplicates: ```json { "aggregations": { "events": { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 0, "buckets": [ { "key": "a", "doc_count": 3 }, { "key": "c", "doc_count": 1 }, { "key": "b", "doc_count": 1 } ] } } } ``` We were not able to notice a difference in latency compared to a regular `terms` aggregation. The overhead in disk space for the new field type (compared to `keyword`) is caused by an additional internal field that tracks value counts and was ~3.5% in a scenario with 8 unique keys with a total cardinality of 200. Both the mapping type and the aggregation are considered experimental and thus lack user documentation. Once we are confident enough to expose this at least as a technical preview feature we will add documentation in a follow-up PR. Closes #100242 --------- Co-authored-by: Martijn van Groningen <[email protected]>
Description
Given documents with the following structure (note that the
event_ids
property can contain duplicate keys which must be considered in the aggregation):we'd like to be able to aggregate the data as follows:
You could essentially call this a "keyed histogram".
Note that there is some flexibility in how the
event_ids
property should look like. For example, the following structure would also be possible (the example is limited to theevent_ids
property of the first document above):Note: This is somewhat related to the idea presented in #61550 (comment).
/cc: @martijnvg
The text was updated successfully, but these errors were encountered: