Support a keyed histogram #100242

danielmitterdorfer · 2023-10-04T07:52:21Z

Description

Given documents with the following structure (note that the event_ids property can contain duplicate keys which must be considered in the aggregation):

{
  "id": "f7c173cdb16c743d",
  "event_ids": ["a", "b", "c", "a", "a"]
},
{
  "id": "a82173cdb16cbba7",
  "event_ids": ["d", "a", "b", "b", "a"]
}

we'd like to be able to aggregate the data as follows:

[
  {
    "key": "a",
    "count": 5
  },
  {
    "key": "b",
    "count": 3
  },
  {
    "key": "c",
    "count": 1
  },
  {
    "key": "d",
    "count": 1
  }
]

You could essentially call this a "keyed histogram".

Note that there is some flexibility in how the event_ids property should look like. For example, the following structure would also be possible (the example is limited to the event_ids property of the first document above):

[
  {
    "key": "a",
    "count": 3
  },
  {
    "key": "b",
    "count": 1
  },
  {
    "key": "c",
    "count": 1
  }
]

Note: This is somewhat related to the idea presented in #61550 (comment).

/cc: @martijnvg

The text was updated successfully, but these errors were encountered:

elasticsearchmachine · 2023-10-04T07:52:46Z

Pinging @elastic/es-analytics-geo (Team:Analytics)

danielmitterdorfer · 2023-10-06T08:59:42Z

@martijnvg came up with the idea to add add a dedicated field type which encodes the values in a way to keep track of duplicates (this will make the format less efficient). The new field mapper type will be similar to a keyword field but maintains two doc values fields: one for the values and one for the number of occurrences. (which often will be 1, but for duplicates by > 1). The field data instance returned by this field should use the occurrence doc values field to emit the duplicates, so aggregations can take duplicated values into account.

With this new field type in place, we can then just use the terms aggregation to retrieve data in the desired format.

With this commit we add support for "keyed histograms". This is achieved by adding a new mapping type `counted_keyword` and the corresponding `counted_terms` aggregation. The new mapping type keeps track of actual counts of duplicate multi-valued fields and the aggregation considers them. Example: ``` PUT /test { "mappings": { "properties": { "event_ids": { "type": "counted_keyword" } } } } POST /test/_doc { "event_ids": ["a", "b", "c", "a", "a"] } GET /test/_search { "aggs": { "events": { "counted_terms": { "field": "event_ids" } } } } ``` This results in buckets that consider duplicates: ```json { "aggregations": { "events": { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 0, "buckets": [ { "key": "a", "doc_count": 3 }, { "key": "c", "doc_count": 1 }, { "key": "b", "doc_count": 1 } ] } } } ``` We were not able to notice a difference in latency compared to a regular `terms` aggregation. The overhead in disk space for the new field type (compared to `keyword`) is caused by an additional internal field that tracks value counts and was ~3.5% in a scenario with 8 unique keys with a total cardinality of 200. Both the mapping type and the aggregation are considered experimental and thus lack user documentation. Once we are confident enough to expose this at least as a technical preview feature we will add documentation in a follow-up PR. Closes #100242 --------- Co-authored-by: Martijn van Groningen <[email protected]>

danielmitterdorfer added >enhancement :Analytics/Aggregations Aggregations labels Oct 4, 2023

elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Oct 4, 2023

danielmitterdorfer self-assigned this Oct 13, 2023

danielmitterdorfer mentioned this issue Nov 6, 2023

Support keyed histograms #101826

Merged

danielmitterdorfer closed this as completed in #101826 Nov 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support a keyed histogram #100242

Support a keyed histogram #100242

danielmitterdorfer commented Oct 4, 2023

elasticsearchmachine commented Oct 4, 2023

danielmitterdorfer commented Oct 6, 2023

Support a keyed histogram #100242

Support a keyed histogram #100242

Comments

danielmitterdorfer commented Oct 4, 2023

Description

elasticsearchmachine commented Oct 4, 2023

danielmitterdorfer commented Oct 6, 2023