executor: add new agg function `APPROX_COUNT_DISTINCT` (#17175) #18120

ti-srebot · 2020-06-18T14:02:23Z

cherry-pick #17175 to release-4.0

Signed-off-by: Tong Zhigao [email protected]

What problem does this PR solve?

Issue Number: close #14632

Problem Summary:

Distinct count very slow and might consume high amount of memory.
If relative error is allowed, we can use sampling algorithm to compute approximate result.

What is changed and how it works?

Add new agg function APPROX_COUNT_DISTINCT.
Use BJKST algorithm to compute approximate result of distinct count.
For the calculation state, it uses a sample of element hash values with a size up to 2^16. Compared with the widely known HyperLogLog algorithm, this algorithm is less effective in terms of accuracy and memory consumption (even up to proportionality), but it is adaptive. This means that with fairly high accuracy, it consumes less memory during simultaneous computation of cardinality for a large number of data sets whose cardinality has power law distribution (i.e. in cases when most of the data sets are small). This algorithm is also very accurate for data sets with small cardinality and very efficient on CPU.
For TiFlash, TiDB can push down cop request and merge all partial result. For other engine, TiDB needs to collect all original data and compute all by itself.

Tests

Unit test

Release note

Add new agg function APPROX_COUNT_DISTINCT to support approximate count distinct.

Signed-off-by: ti-srebot <[email protected]>

ti-srebot · 2020-06-18T14:02:24Z

/run-all-tests

ti-srebot · 2020-06-18T14:02:33Z

@solotzg please accept the invitation then you can push to the cherry-pick pull requests.
https://github.com/ti-srebot/tidb/invitations

Signed-off-by: Tong Zhigao <[email protected]>

solotzg · 2020-06-18T15:35:10Z

/run-all-tests

solotzg · 2020-06-18T15:53:49Z

/run-all-tests

solotzg · 2020-06-18T16:22:44Z

/run-all-tests

lzmhhh123

LGTM

solotzg · 2020-06-19T01:44:15Z

/run-all-tests

XuHuaiyu

LGTM

XuHuaiyu · 2020-06-19T02:42:52Z

/merge

ti-srebot · 2020-06-19T02:42:56Z

Sorry @XuHuaiyu, you don't have permission to trigger auto merge event on this branch.

XuHuaiyu · 2020-06-19T02:45:51Z

/run-all-tests

solotzg · 2020-06-19T03:27:06Z

/run-all-tests

solotzg · 2020-06-19T03:47:54Z

/run-all-tests

cherry pick pingcap#17175 to release-4.0

db7aecd

Signed-off-by: ti-srebot <[email protected]>

ti-srebot mentioned this pull request Jun 18, 2020

executor: add new agg function APPROX_COUNT_DISTINCT #17175

Merged

ti-srebot added sig/execution SIG execution component/expression type/4.0-cherry-pick labels Jun 18, 2020

ti-srebot requested review from lzmhhh123, winoros, wshwsh12 and XuHuaiyu June 18, 2020 14:02

ti-srebot added this to the v4.0.2 milestone Jun 18, 2020

ti-srebot assigned solotzg Jun 18, 2020

solotzg added 2 commits June 18, 2020 23:03

fix conflict

f9c5fa6

Signed-off-by: Tong Zhigao <[email protected]>

fix conflict

0b6857f

Signed-off-by: Tong Zhigao <[email protected]>

lzmhhh123 reviewed Jun 19, 2020

View reviewed changes

lzmhhh123 added the status/LGT1 Indicates that a PR has LGTM 1. label Jun 19, 2020

Merge branch 'release-4.0' into release-4.0-978370f7cbd3

41b4d1a

XuHuaiyu approved these changes Jun 19, 2020

View reviewed changes

XuHuaiyu added status/LGT2 Indicates that a PR has LGTM 2. and removed status/LGT1 Indicates that a PR has LGTM 1. labels Jun 19, 2020

XuHuaiyu merged commit 6c2a572 into pingcap:release-4.0 Jun 19, 2020

XuHuaiyu deleted the release-4.0-978370f7cbd3 branch June 19, 2020 03:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

executor: add new agg function `APPROX_COUNT_DISTINCT` (#17175) #18120

executor: add new agg function `APPROX_COUNT_DISTINCT` (#17175) #18120

ti-srebot commented Jun 18, 2020

ti-srebot commented Jun 18, 2020

ti-srebot commented Jun 18, 2020

solotzg commented Jun 18, 2020

solotzg commented Jun 18, 2020

solotzg commented Jun 18, 2020

lzmhhh123 left a comment

solotzg commented Jun 19, 2020

XuHuaiyu left a comment

XuHuaiyu commented Jun 19, 2020

ti-srebot commented Jun 19, 2020

XuHuaiyu commented Jun 19, 2020

solotzg commented Jun 19, 2020

solotzg commented Jun 19, 2020

executor: add new agg function APPROX_COUNT_DISTINCT (#17175) #18120

executor: add new agg function APPROX_COUNT_DISTINCT (#17175) #18120

Conversation

ti-srebot commented Jun 18, 2020

What problem does this PR solve?

What is changed and how it works?

Release note

ti-srebot commented Jun 18, 2020

ti-srebot commented Jun 18, 2020

solotzg commented Jun 18, 2020

solotzg commented Jun 18, 2020

solotzg commented Jun 18, 2020

lzmhhh123 left a comment

Choose a reason for hiding this comment

solotzg commented Jun 19, 2020

XuHuaiyu left a comment

Choose a reason for hiding this comment

XuHuaiyu commented Jun 19, 2020

ti-srebot commented Jun 19, 2020

XuHuaiyu commented Jun 19, 2020

solotzg commented Jun 19, 2020

solotzg commented Jun 19, 2020

executor: add new agg function `APPROX_COUNT_DISTINCT` (#17175) #18120

executor: add new agg function `APPROX_COUNT_DISTINCT` (#17175) #18120