Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support a keyed histogram #100242

Closed
danielmitterdorfer opened this issue Oct 4, 2023 · 2 comments · Fixed by #101826
Closed

Support a keyed histogram #100242

danielmitterdorfer opened this issue Oct 4, 2023 · 2 comments · Fixed by #101826
Assignees
Labels
:Analytics/Aggregations Aggregations >enhancement Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo)

Comments

@danielmitterdorfer
Copy link
Member

Description

Given documents with the following structure (note that the event_ids property can contain duplicate keys which must be considered in the aggregation):

{
  "id": "f7c173cdb16c743d",
  "event_ids": ["a", "b", "c", "a", "a"]
},
{
  "id": "a82173cdb16cbba7",
  "event_ids": ["d", "a", "b", "b", "a"]
}

we'd like to be able to aggregate the data as follows:

[
  {
    "key": "a",
    "count": 5
  },
  {
    "key": "b",
    "count": 3
  },
  {
    "key": "c",
    "count": 1
  },
  {
    "key": "d",
    "count": 1
  }
]

You could essentially call this a "keyed histogram".

Note that there is some flexibility in how the event_ids property should look like. For example, the following structure would also be possible (the example is limited to the event_ids property of the first document above):

[
  {
    "key": "a",
    "count": 3
  },
  {
    "key": "b",
    "count": 1
  },
  {
    "key": "c",
    "count": 1
  }
]

Note: This is somewhat related to the idea presented in #61550 (comment).

/cc: @martijnvg

@elasticsearchmachine elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Oct 4, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (Team:Analytics)

@danielmitterdorfer
Copy link
Member Author

@martijnvg came up with the idea to add add a dedicated field type which encodes the values in a way to keep track of duplicates (this will make the format less efficient). The new field mapper type will be similar to a keyword field but maintains two doc values fields: one for the values and one for the number of occurrences. (which often will be 1, but for duplicates by > 1). The field data instance returned by this field should use the occurrence doc values field to emit the duplicates, so aggregations can take duplicated values into account.

With this new field type in place, we can then just use the terms aggregation to retrieve data in the desired format.

@danielmitterdorfer danielmitterdorfer self-assigned this Oct 13, 2023
danielmitterdorfer added a commit that referenced this issue Nov 15, 2023
With this commit we add support for "keyed histograms". This is achieved by 
adding a new mapping type `counted_keyword` and the corresponding 
`counted_terms` aggregation. The new mapping type keeps track of actual counts
of duplicate multi-valued fields and the aggregation considers them.

Example:

```
PUT /test
{
  "mappings": {
    "properties": {
      "event_ids": { "type": "counted_keyword" }
    }
  }
}

POST /test/_doc
{
  "event_ids": ["a", "b", "c", "a", "a"]
}

GET /test/_search
{
  "aggs": {
    "events": {
      "counted_terms": { "field": "event_ids" }
    }
  }
}
```

This results in buckets that consider duplicates:

```json
{
  "aggregations": {
    "events": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "a",
          "doc_count": 3
        },
        {
          "key": "c",
          "doc_count": 1
        },
        {
          "key": "b",
          "doc_count": 1
        }
      ]
    }
  }
}
```

We were not able to notice a difference in latency compared to a regular `terms`
aggregation. The overhead in disk space for the new field type (compared to 
`keyword`) is caused by an additional internal field that tracks value counts 
and was ~3.5% in a scenario with 8 unique keys with a total cardinality of 200.

Both the mapping type and the aggregation are considered experimental and thus
lack user documentation. Once we are confident enough to expose this at least
as a technical preview feature we will add documentation in a follow-up PR.

Closes #100242

---------

Co-authored-by: Martijn van Groningen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/Aggregations Aggregations >enhancement Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants