You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe:
The reservoir sampling collects too many wasted samples.
We're using the this one
We need to make sure that each sub-collector collects the same number of samples as the root one when we are in the distributed case.
So when we are collecting 10K samples. We need to collect 100K samples from each region. And each region has about 1 million rows by default option. This means that if we want to collect 10K samples for a table with 1 billion rows(the ideal sample rate here is 10^5/10^9=10^-4=0.01%), we actually collect 10^5/10^6 * 10^9=10^8 samples(the sample rate here is 10%).
You can see that 0.01% vs 10%. There's a huge waste.
Describe the feature you'd like:
We need a better sampling algorithm to not waste so much samples. It will increase the memory, CPU and network cost.
Feature Request
Is your feature request related to a problem? Please describe:
The reservoir sampling collects too many wasted samples.
We're using the this one
We need to make sure that each sub-collector collects the same number of samples as the root one when we are in the distributed case.
So when we are collecting 10K samples. We need to collect 100K samples from each region. And each region has about 1 million rows by default option. This means that if we want to collect 10K samples for a table with 1 billion rows(the ideal sample rate here is 10^5/10^9=10^-4=0.01%), we actually collect 10^5/10^6 * 10^9=10^8 samples(the sample rate here is 10%).
You can see that 0.01% vs 10%. There's a huge waste.
Describe the feature you'd like:
We need a better sampling algorithm to not waste so much samples. It will increase the memory, CPU and network cost.
Describe alternatives you've considered:
Teachability, Documentation, Adoption, Migration Strategy:
The text was updated successfully, but these errors were encountered: