Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up DistinctCountAccumulator #5472

Closed
Dandandan opened this issue Mar 3, 2023 · 10 comments
Closed

Speed up DistinctCountAccumulator #5472

Dandandan opened this issue Mar 3, 2023 · 10 comments
Labels
enhancement New feature or request performance Make DataFusion faster

Comments

@Dandandan
Copy link
Contributor

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Currently the code uses a HashSet<ScalarValue> to track unique keys, which is slow and uses more memory than needed.

Describe the solution you'd like
Use a typed array for storing the entries & keep a hashmap.
We can use the same approach as present in the dictionary builders (PrimitiveDictionaryBuilder) or parquet Interner contributed by @tustvold.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

@Dandandan Dandandan added enhancement New feature or request performance Make DataFusion faster labels Mar 3, 2023
@tustvold
Copy link
Contributor

tustvold commented Mar 3, 2023

I'm not intimately familiar with this operator, but if it needs to support tuples or nested types the row format might be an option. You can implement the operator using rows and convert to OwnedRow to store in a HashSet

@Dandandan
Copy link
Contributor Author

FYI @comphead For dictionary implementation see e.g. https://docs.rs/arrow-array/34.0.0/src/arrow_array/builder/primitive_dictionary_builder.rs.html#82

@comphead
Copy link
Contributor

comphead commented Mar 7, 2023

Hi @Dandandan I was crawling through the dictionary builder implementation.
It has 3-in-1 : keys array, values array and a hashmap, won't be that causing an overhead?

Please elaborate if we can try a simple HashSet<ArrowPrimitiveType> as an alternative?

@alamb
Copy link
Contributor

alamb commented Jul 17, 2023

I think we could make COUNT DISTINCT go much faster if implemented via HashSet<ArrowPrimitiveType>. If that was implemented as a GroupsAccumuluator I suspect COUNT DISTINCT would be relatively screaming

@Dandandan
Copy link
Contributor Author

I wonder how a GroupsAccumulator implementation could look like.
Make something like a HashSet<(u64, OwnedRow>) (index + row content) or HashSet<(u64, ArrowPrimitiveType>) for single rows / primitive type.

A way to make it even faster is to use create_hashes + memoize hashes (i.e. also store hash in table) and use the RawTable API.

Also there are probably more opportunities to rewrite queries (i.e. rewrite select distinct + join on the group by keys might work faster in some cases).

@alamb
Copy link
Contributor

alamb commented Jul 17, 2023

I was thinking something like the following (templated on primitive type):

struct CountDistinctGroupsAccumulator<T: ArrowPrimitiveType> {
  values: HashSet<T>,
}

Clearly we could then take it a step farther to use the hashbrown API to go even faster

Then use DataType::List as we do today to pass intermediate state

Also there are probably more opportunities to rewrite queries (i.e. rewrite select distinct + join on the group by keys might work faster in some cases).

that is also quite interesting

@Dandandan
Copy link
Contributor Author

I tried to prototype a solution.

I think we need more than just storing a HashSet as the distinct count needs to be distinct per group index not just overall number of distinct values (that's why the original has to use a HashMap per accumulator). It needs to keep track of the unique values per group index and produce them (as list) to compute the the final count per group index.

I think an efficient way to store the data for primitives could be the following:

PrimitiveDistinctCountGroupsAccumulator<T: ArrowPrimitiveType> {
    // stores unique group index + index of value in `values` for this group index
    uniques: RawTable<(usize, usize)>,
    /// stores the unique values
    values: Vec<T::Native>,
    /// Stores the index (+1) of the first value for index (0 is end of list)
    /// The value is stored in`values`, the next (possible) index in `next`
    /// First item for group index is stored at group index
    first: Vec<usize>,
    /// Stores the location (+1) of the next value (0 is end of list)
    next: Vec<usize>,
    /// Track nulls in the input / filters
    null_state: NullState,
    /// stores number of groups (to produce state)
    total_num_groups: usize,
    /// random state
    random_state: RandomState
}

uniques stores the group id + index to the value.
We can use a similar datastructure as in the hash join to store the data in a chained list datastructure in a single Vec.
This makes it possible to produce a list array for each group index.

@alamb
Copy link
Contributor

alamb commented Jul 24, 2023

Storing chained values seems like a neat idea -- if you plan to use it for the Join maybe we could make it its own data structure and reuse in both places. 🤔

@alamb
Copy link
Contributor

alamb commented Jan 31, 2024

FWIW we made great progress in #8849 and #8721

The remaining types that might be important are Binary/LargeBinary but I would be inclined to close this ticket as complete now that we have fast count distinct for integers and strings

@yjshen
Copy link
Member

yjshen commented Feb 29, 2024

Since #8827 has been merged(including the support for both Binary/LargeBinary), I think it's okay to close this one.

@yjshen yjshen closed this as completed Feb 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request performance Make DataFusion faster
Projects
None yet
Development

No branches or pull requests

5 participants