refactor count_distinct to not to have update and merge #5408

Weijun-H · 2023-02-26T23:37:53Z

Which issue does this PR close?

Parts #1598

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Dandandan · 2023-02-27T01:16:25Z

datafusion/physical-expr/src/aggregate/count_distinct.rs

+        let arr = &values[0];
+        (0..arr.len()).try_for_each(|index| {
+            let scalar = ScalarValue::try_from_array(arr, index)?;
+            let scalar = vec![scalar];


We can avoid creating a Vec here

Hello @Dandandan, thank you for reviewing my work. I'm currently having difficulty finding a way to avoid using vec in this context. Could you please provide me with some guidance on how to refactor it?

We can modify DistinctScalarValues from
struct DistinctScalarValues(Vec<ScalarValue>)
to
struct DistinctScalarValues(ScalarValue)

and update the code accordingly.

This might provide some small performance increase.

Dandandan · 2023-02-27T01:16:56Z

datafusion/physical-expr/src/aggregate/count_distinct.rs

+        let arr = &states[0];
+        (0..arr.len()).try_for_each(|index| {
+            let scalar = ScalarValue::try_from_array(arr, index)?;
+            let scalar = vec![scalar];


We can rewrite the code to not create the Vec

Dandandan · 2023-02-27T10:40:44Z

datafusion/physical-expr/src/aggregate/count_distinct.rs

+        let arr = &values[0];
+        (0..arr.len()).try_for_each(|index| {
+            let scalar = ScalarValue::try_from_array(arr, index)?;
+            if !ScalarValue::is_null(&scalar) {


This check can be done on the array arr already instead of on the scalar.

Dandandan · 2023-02-27T19:54:01Z

@Weijun-H The tests are failing, could you take a look?

Dandandan · 2023-02-27T19:56:39Z

datafusion/physical-expr/src/aggregate/count_distinct.rs

-                .collect::<Result<Vec<_>>>()?;
-            self.merge(&v)
-        })
-    }
    fn state(&self) -> Result<Vec<ScalarValue>> {
        let mut cols_out = self
            .state_data_types


state_data_types can be simplified to be of type DataType too instead of Vec<DataType>

(and code to be updated accordingly

state_data_types can be simplified to be of type DataType too instead of Vec<DataType>

Do you mean state_data_types in DistinctCountAccumulator and DistinctCount?

Yes, that doesn't need to be a Vec, as it always use a single state column, this will simplify the code some more.

Weijun-H · 2023-02-27T22:44:17Z

@Dandandan, I reviewed the failed test and identified the root cause of the issue. It appears that the problem occurred because there was a modification made to the structure of the DistinctScalarValues from DistinctScalarValues(Vec<ScalarValue>) to DistinctScalarValues(ScalarValue). This change caused the test to fail because the test was designed to work with two different types of data.

https://github.com/apache/arrow-datafusion/blob/16cb4c122f8ea110bc7adf425f4905fa06ed2c81/datafusion/physical-expr/src/aggregate/count_distinct.rs#L608-L642

Dandandan · 2023-02-28T08:22:38Z

@Dandandan, I reviewed the failed test and identified the root cause of the issue. It appears that the problem occurred because there was a modification made to the structure of the DistinctScalarValues from DistinctScalarValues(Vec<ScalarValue>) to DistinctScalarValues(ScalarValue). This change caused the test to fail because the test was designed to work with two different types of data.

https://github.com/apache/arrow-datafusion/blob/16cb4c122f8ea110bc7adf425f4905fa06ed2c81/datafusion/physical-expr/src/aggregate/count_distinct.rs#L608-L642

Ok. I think we have to update (or remove) the tests to update the expectations that we no longer support multiple columns.

Dandandan · 2023-02-28T09:32:55Z

datafusion/physical-expr/src/aggregate/count_distinct.rs

@@ -41,7 +41,7 @@ pub struct DistinctCount {
    /// The DataType for the final count
    data_type: DataType,
    /// The DataType used to hold the state for each input
-    state_data_types: Vec<DataType>,
+    state_data_types: DataType,
    /// The input arguments
    exprs: Vec<Arc<dyn PhysicalExpr>>,


I think we should also update exprs: Vec<Arc<dyn PhysicalExpr>> to exprs: Arc<dyn PhysicalExpr> and DistinctCount::new to communicate that multiple columns are no longer supported.

Maybe input_data_types: Vec<DataType> also need to be changed to DataType, because it also need one DataType?

Yeah, as well.
If you like to do some more refactoring, you could remove data_type: DataType from the DistinctCount struct as well and just keep a int64 in the DistinctCountAccumulator instead of using a ScalarValue + count_data_type .

Dandandan · 2023-03-01T13:27:44Z

datafusion/physical-expr/src/aggregate/count_distinct.rs

@@ -113,106 +102,61 @@ impl AggregateExpr for DistinctCount {
 #[derive(Debug)]
 struct DistinctCountAccumulator {
    values: HashSet<DistinctScalarValues, RandomState>,
-    state_data_types: Vec<DataType>,
+    state_data_types: DataType,
    count_data_type: DataType,


Can you remove this as well (it's basically unused by now)?

Dandandan · 2023-03-01T13:28:46Z

TY @Weijun-H I think it's looking great!
Can you fix the conflict and look at my remaining comment?

Dandandan · 2023-03-01T13:29:59Z

datafusion/physical-expr/src/aggregate/count_distinct.rs

@@ -31,35 +31,30 @@ use datafusion_common::{DataFusionError, Result};
 use datafusion_expr::Accumulator;

 #[derive(Debug, PartialEq, Eq, Hash, Clone)]
-struct DistinctScalarValues(Vec<ScalarValue>);
+struct DistinctScalarValues(ScalarValue);


We could use HashSet<DistinctScalarValues, RandomState> instead and remove this struct.

Dandandan · 2023-03-01T14:35:56Z

datafusion/physical-expr/src/aggregate/count_distinct.rs

    /// The DataType used to hold the state for each input
-    state_data_types: Vec<DataType>,
+    state_data_types: DataType,


Suggested change

state_data_types: DataType,

state_data_type: DataType,

Dandandan · 2023-03-01T14:36:51Z

datafusion/physical-expr/src/aggregate/count_distinct.rs

    /// The input arguments
-    exprs: Vec<Arc<dyn PhysicalExpr>>,
+    exprs: Arc<dyn PhysicalExpr>,


Suggested change

exprs: Arc<dyn PhysicalExpr>,

expr: Arc<dyn PhysicalExpr>,

Dandandan · 2023-03-01T14:37:01Z

datafusion/physical-expr/src/aggregate/count_distinct.rs

-        input_data_types: Vec<DataType>,
-        exprs: Vec<Arc<dyn PhysicalExpr>>,
+        input_data_types: DataType,
+        exprs: Arc<dyn PhysicalExpr>,


Suggested change

exprs: Arc<dyn PhysicalExpr>,

expr: Arc<dyn PhysicalExpr>,

Dandandan · 2023-03-01T14:37:10Z

datafusion/physical-expr/src/aggregate/count_distinct.rs

 }

 impl DistinctCount {
    /// Create a new COUNT(DISTINCT) aggregate function.
    pub fn new(
-        input_data_types: Vec<DataType>,
-        exprs: Vec<Arc<dyn PhysicalExpr>>,
+        input_data_types: DataType,


Suggested change

input_data_types: DataType,

input_data_type: DataType,

Dandandan · 2023-03-01T14:37:38Z

datafusion/physical-expr/src/aggregate/count_distinct.rs

@@ -113,43 +99,10 @@ impl AggregateExpr for DistinctCount {
 #[derive(Debug)]
 struct DistinctCountAccumulator {
    values: HashSet<DistinctScalarValues, RandomState>,
-    state_data_types: Vec<DataType>,
-    count_data_type: DataType,
+    state_data_types: DataType,


Suggested change

state_data_types: DataType,

state_data_type: DataType,

Dandandan

Nice, I think this is a great step forward 👍

Weijun-H · 2023-03-01T18:08:59Z

Nice, I think this is a great step forward 👍

Thank you for your patient guidance.

Dandandan · 2023-03-01T18:09:07Z

FYI @alamb this has some backwards-incompatible changes. I think it removes some unnecessary complexity, making it easier to improve performance later on.

Dandandan · 2023-03-02T12:30:37Z

Merging this in 24 hours if no other comments

alamb

Looks great to me -- thank you @Weijun-H and @Dandandan

ursabot · 2023-03-03T08:12:09Z

Benchmark runs are scheduled for baseline = ddd64e7 and contender = d11820a. d11820a is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

comphead · 2023-03-08T17:51:42Z

datafusion/physical-expr/src/aggregate/count_distinct.rs

    }

    fn size(&self) -> usize {
-        if self.count_data_type.is_primitive() {


I think we lost the size support for variable sized values during this PR, is it expected? @alamb

It was not intended -- if someone has time to fix it that would be great, otherwise I will try and get a PR up later today

Filed #5534

github-actions bot added the physical-expr Physical Expressions label Feb 26, 2023

Dandandan reviewed Feb 27, 2023

View reviewed changes

Weijun-H force-pushed the refactor-count-distinct branch from 04eeb2b to 8158d6b Compare February 27, 2023 10:30

Dandandan reviewed Feb 27, 2023

View reviewed changes

Weijun-H force-pushed the refactor-count-distinct branch from 8158d6b to 16cb4c1 Compare February 27, 2023 13:02

Dandandan reviewed Feb 27, 2023

View reviewed changes

Weijun-H force-pushed the refactor-count-distinct branch from 16cb4c1 to 11798fc Compare February 28, 2023 09:26

Dandandan reviewed Feb 28, 2023

View reviewed changes

Weijun-H force-pushed the refactor-count-distinct branch from 11798fc to 2517567 Compare February 28, 2023 12:35

Dandandan reviewed Mar 1, 2023

View reviewed changes

Weijun-H force-pushed the refactor-count-distinct branch from 2517567 to a328f9c Compare March 1, 2023 14:30

Dandandan reviewed Mar 1, 2023

View reviewed changes

Weijun-H force-pushed the refactor-count-distinct branch from a328f9c to 98fcc97 Compare March 1, 2023 15:03

refactor count_distinct to not to have update and merge

fb4a17a

Weijun-H force-pushed the refactor-count-distinct branch from 98fcc97 to fb4a17a Compare March 1, 2023 17:58

Dandandan approved these changes Mar 1, 2023

View reviewed changes

alamb approved these changes Mar 2, 2023

View reviewed changes

Dandandan merged commit d11820a into apache:main Mar 3, 2023

comphead reviewed Mar 8, 2023

View reviewed changes

This was referenced Mar 9, 2023

revert accidently deleted size code in count_distinct #5533

Merged

Revert accidently deleted size code in count_distinct #5534

Closed

allenma mentioned this pull request Apr 10, 2023

Count distinct support multiple expressions #5939

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor count_distinct to not to have update and merge #5408

refactor count_distinct to not to have update and merge #5408

Weijun-H commented Feb 26, 2023

Dandandan Feb 27, 2023

Weijun-H Feb 27, 2023

Dandandan Feb 27, 2023

Dandandan Feb 27, 2023

Dandandan Feb 27, 2023 •

edited

Loading

Dandandan commented Feb 27, 2023

Dandandan Feb 27, 2023

Dandandan Feb 27, 2023

Weijun-H Feb 27, 2023

Dandandan Feb 27, 2023

Weijun-H commented Feb 27, 2023 •

edited

Loading

Dandandan commented Feb 28, 2023

Dandandan Feb 28, 2023

Weijun-H Feb 28, 2023 •

edited

Loading

Dandandan Feb 28, 2023 •

edited

Loading

Dandandan Mar 1, 2023

Dandandan commented Mar 1, 2023

Dandandan Mar 1, 2023

Dandandan Mar 1, 2023

Dandandan Mar 1, 2023

Dandandan Mar 1, 2023

Dandandan Mar 1, 2023

Dandandan Mar 1, 2023

Dandandan left a comment

Weijun-H commented Mar 1, 2023

Dandandan commented Mar 1, 2023

Dandandan commented Mar 2, 2023

alamb left a comment

ursabot commented Mar 3, 2023

comphead Mar 8, 2023

alamb Mar 9, 2023

comphead Mar 9, 2023

refactor count_distinct to not to have update and merge #5408

refactor count_distinct to not to have update and merge #5408

Conversation

Weijun-H commented Feb 26, 2023

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Dandandan Feb 27, 2023 • edited Loading

Choose a reason for hiding this comment

Dandandan commented Feb 27, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Weijun-H commented Feb 27, 2023 • edited Loading

Dandandan commented Feb 28, 2023

Choose a reason for hiding this comment

Weijun-H Feb 28, 2023 • edited Loading

Choose a reason for hiding this comment

Dandandan Feb 28, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Dandandan commented Mar 1, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Dandandan left a comment

Choose a reason for hiding this comment

Weijun-H commented Mar 1, 2023

Dandandan commented Mar 1, 2023

Dandandan commented Mar 2, 2023

alamb left a comment

Choose a reason for hiding this comment

ursabot commented Mar 3, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Dandandan Feb 27, 2023 •

edited

Loading

Weijun-H commented Feb 27, 2023 •

edited

Loading

Weijun-H Feb 28, 2023 •

edited

Loading

Dandandan Feb 28, 2023 •

edited

Loading