-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
executor: add new agg function APPROX_COUNT_DISTINCT
(#17175)
#18120
executor: add new agg function APPROX_COUNT_DISTINCT
(#17175)
#18120
Conversation
Signed-off-by: ti-srebot <[email protected]>
/run-all-tests |
@solotzg please accept the invitation then you can push to the cherry-pick pull requests. |
Signed-off-by: Tong Zhigao <[email protected]>
Signed-off-by: Tong Zhigao <[email protected]>
/run-all-tests |
2 similar comments
/run-all-tests |
/run-all-tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
/run-all-tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
/merge |
Sorry @XuHuaiyu, you don't have permission to trigger auto merge event on this branch. |
/run-all-tests |
2 similar comments
/run-all-tests |
/run-all-tests |
cherry-pick #17175 to release-4.0
Signed-off-by: Tong Zhigao [email protected]
What problem does this PR solve?
Issue Number: close #14632
Problem Summary:
What is changed and how it works?
Add new agg function
APPROX_COUNT_DISTINCT
.Use
BJKST
algorithm to compute approximate result of distinct count.For the calculation state, it uses a sample of element hash values with a size up to 2^16. Compared with the widely known HyperLogLog algorithm, this algorithm is less effective in terms of accuracy and memory consumption (even up to proportionality), but it is adaptive. This means that with fairly high accuracy, it consumes less memory during simultaneous computation of cardinality for a large number of data sets whose cardinality has power law distribution (i.e. in cases when most of the data sets are small). This algorithm is also very accurate for data sets with small cardinality and very efficient on CPU.
For TiFlash, TiDB can push down cop request and merge all partial result. For other engine, TiDB needs to collect all original data and compute all by itself.
Tests
Release note
APPROX_COUNT_DISTINCT
to support approximate count distinct.