Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce new sampling algorithm for statistics collecting #27357

Closed
winoros opened this issue Aug 18, 2021 · 1 comment
Closed

Introduce new sampling algorithm for statistics collecting #27357

winoros opened this issue Aug 18, 2021 · 1 comment
Assignees
Labels
sig/planner SIG: Planner type/feature-request Categorizes issue or PR as related to a new feature.

Comments

@winoros
Copy link
Member

winoros commented Aug 18, 2021

Feature Request

Is your feature request related to a problem? Please describe:

The reservoir sampling collects too many wasted samples.
We're using the this one
We need to make sure that each sub-collector collects the same number of samples as the root one when we are in the distributed case.

So when we are collecting 10K samples. We need to collect 100K samples from each region. And each region has about 1 million rows by default option. This means that if we want to collect 10K samples for a table with 1 billion rows(the ideal sample rate here is 10^5/10^9=10^-4=0.01%), we actually collect 10^5/10^6 * 10^9=10^8 samples(the sample rate here is 10%).

You can see that 0.01% vs 10%. There's a huge waste.

Describe the feature you'd like:

We need a better sampling algorithm to not waste so much samples. It will increase the memory, CPU and network cost.

Describe alternatives you've considered:

Teachability, Documentation, Adoption, Migration Strategy:

@winoros winoros added the type/feature-request Categorizes issue or PR as related to a new feature. label Aug 18, 2021
@winoros winoros self-assigned this Aug 18, 2021
@winoros winoros added the sig/planner SIG: Planner label Aug 18, 2021
@winoros
Copy link
Member Author

winoros commented Dec 22, 2023

implemented

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sig/planner SIG: Planner type/feature-request Categorizes issue or PR as related to a new feature.
Projects
Status: Finished
Development

No branches or pull requests

1 participant