Apply workaround for #5444 to `DataFrame::describe` #5468

jiangzhx · 2023-03-03T14:46:48Z

Which issue does this PR close?

Apply a patch while Expr.alias function not work with count aggregation #5444 is not fixed
use the alltypes_tiny_pages.parquet file as a data test set.

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

alamb

Thank you @jiangzhx -- looks like an improvement to me. Maybe @Jefffrey also has some time to review this PR as they were commenting on #5444

alamb · 2023-03-03T22:03:24Z

datafusion/core/src/dataframe.rs

-                .aggregate(
-                    vec![],
-                    fields_iter
+            vec![RecordBatch::try_new(


I don't fully understand this change, but it seems ok.

This doesn't fix the issue 5444, but it did lead me to discover a better way to speed up the operation of describe.

Jefffrey · 2023-03-04T00:52:24Z

I don't fully understand these changes either, will need to take a closer look as I'm not yet familiar with describe(...), but it seems this PR is a workaround to the actual issue of #5444 and wouldn't resolve it

jiangzhx · 2023-03-04T01:26:09Z

I don't fully understand these changes either, will need to take a closer look as I'm not yet familiar with describe(...), but it seems this PR is a workaround to the actual issue of #5444 and wouldn't resolve it

yes,#5444 need find other way to fix. it's should not to close .
but #5444 lead me to discover a better way to speed up the operation of describe.

Jefffrey · 2023-03-04T01:30:49Z

Is the method of speeding up referring to being able to leverage aggregate_statistics physical optimizer if possible to be able to get the stats faster? Then you hit the issue in #5444 due to these lines in describe(...):

https://github.com/apache/arrow-datafusion/blob/e9852074bacd8c891d84eba38b3417aa16a2d18c/datafusion/core/src/dataframe.rs#L441-L454

Where can't find the column by name since the name of the count(...) column is wrong hence goes to the None case. So this is a workaround to make it work?

Is my understanding correct @jiangzhx ?

jiangzhx · 2023-03-04T02:03:47Z

Is the method of speeding up referring to being able to leverage aggregate_statistics physical optimizer if possible to be able to get the stats faster? Then you hit the issue in #5444 due to these lines in describe(...):

https://github.com/apache/arrow-datafusion/blob/e9852074bacd8c891d84eba38b3417aa16a2d18c/datafusion/core/src/dataframe.rs#L441-L454

Where can't find the column by name since the name of the count(...) column is wrong hence goes to the None case. So this is a workaround to make it work?

Is my understanding correct @jiangzhx ?

yes, you are right.

the problem was caused by this function
called Result::unwrap()on anErr value: ArrowError(InvalidArgumentError("It is not possible to concatenate arrays of different data types."))

https://github.com/apache/arrow-datafusion/blob/e9852074bacd8c891d84eba38b3417aa16a2d18c/datafusion/core/src/dataframe.rs#L457-L463

Jefffrey · 2023-03-04T02:23:59Z

Ok I understand now, thanks.

My opinion is to not have this workaround unless its urgently needed (since this would fix a bug in describe(...) which otherwise can't be avoided at the moment I think), as it adds a bit of complexity to the describe(...) method which would become redundant when the core issue #5444 is resolved (in which case this workaround wouldn't be needed anymore), if things work out right.

But if can be diligent to remove this workaround when #5444 is fixed (and confirmed to also fix this without workaround) then it seems fine.

jiangzhx · 2023-03-04T03:59:32Z

Ok I understand now, thanks.

My opinion is to not have this workaround unless its urgently needed (since this would fix a bug in describe(...) which otherwise can't be avoided at the moment I think), as it adds a bit of complexity to the describe(...) method which would become redundant when the core issue #5444 is resolved (in which case this workaround wouldn't be needed anymore), if things work out right.

But if can be diligent to remove this workaround when #5444 is fixed (and confirmed to also fix this without workaround) then it seems fine.

Thanks for your suggestion @Jefffrey . How about this solution? It might be easier to understand.

//count aggregation
let cnt = self.clone().aggregate(
    vec![],
    original_schema_fields
        .clone()
        .map(|f| count(col(f.name())))
        .collect::<Vec<_>>(),
)?;
let cnt = cnt
    .clone()
    .select(
        cnt.schema()
            .fields()
            .iter()
            .zip(original_schema_fields.clone())
            .map(|(count_field, orgin_field)| {
                col(count_field.name()).alias(orgin_field.name())
            })
            .collect::<Vec<_>>(),
    )
    .unwrap();
let count_record_batch = cnt.collect().await.unwrap()；

Jefffrey · 2023-03-04T10:07:58Z

@jiangzhx That certainly is a more straightforward workaround, if you decide to go with it then having a comment explaining why the workaround is there would be helpful too

jiangzhx · 2023-03-04T11:27:11Z

@jiangzhx That certainly is a more straightforward workaround, if you decide to go with it then having a comment explaining why the workaround is there would be helpful too

The issue has been resolved and the fix implemented was based on your suggestion.
thank you @Jefffrey .

alamb · 2023-03-04T11:33:28Z

datafusion/core/src/dataframe.rs

+        // The optimization of AggregateStatistics will rewrite the physical plan
+        // for the count function and ignore alias functions,
+        // as shown in https://github.com/apache/arrow-datafusion/issues/5444.
+        // This logic should be removed when #5444 is fixed.


👍 -- thank you this comment makes it much clearer what is going on

datafusion/core/src/dataframe.rs

alamb

LGTM -- thank you @jiangzhx

Jefffrey

Looks good, just some minor nitpicks

datafusion/core/src/dataframe.rs

alamb · 2023-03-06T17:04:50Z

Thanks @jiangzhx and thanks @Jefffrey for the review 🙏

ursabot · 2023-03-06T17:07:01Z

Benchmark runs are scheduled for baseline = d0bd28e and contender = 99ef989. 99ef989 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

github-actions bot added the core Core DataFusion crate label Mar 3, 2023

alamb changed the title ~~use table stats for descrbe method count result~~ use table stats for DataFrame::describe method count result Mar 3, 2023

alamb reviewed Mar 3, 2023

View reviewed changes

jiangzhx force-pushed the issues/5444 branch from 81456f5 to db13c12 Compare March 4, 2023 11:24

alamb reviewed Mar 4, 2023

View reviewed changes

datafusion/core/src/dataframe.rs Outdated Show resolved Hide resolved

jiangzhx force-pushed the issues/5444 branch from db13c12 to bb290e9 Compare March 4, 2023 11:37

jiangzhx changed the title ~~use table stats for DataFrame::describe method count result~~ Apply a patch while Expr.alias function not work with count aggregation #5444 is not fixed Mar 4, 2023

jiangzhx changed the title ~~Apply a patch while Expr.alias function not work with count aggregation #5444 is not fixed~~ Apply a patch while #5444 is not fixed Mar 4, 2023

jiangzhx requested a review from alamb March 4, 2023 12:55

alamb changed the title ~~Apply a patch while #5444 is not fixed~~ Apply workaround for DataFrame::describe while #5444 is not fixed Mar 4, 2023

alamb changed the title ~~Apply workaround for DataFrame::describe while #5444 is not fixed~~ Apply workaround for #5444 to DataFrame::describe Mar 4, 2023

alamb approved these changes Mar 4, 2023

View reviewed changes

Jefffrey approved these changes Mar 4, 2023

View reviewed changes

datafusion/core/src/dataframe.rs Show resolved Hide resolved

datafusion/core/src/dataframe.rs Outdated Show resolved Hide resolved

datafusion/core/src/dataframe.rs Outdated Show resolved Hide resolved

Apply workaround for apache#5444 to DataFrame::describe

01aeb59

jiangzhx force-pushed the issues/5444 branch from bb290e9 to 01aeb59 Compare March 5, 2023 04:32

alamb merged commit 99ef989 into apache:main Mar 6, 2023

Jefffrey mentioned this pull request Mar 11, 2023

Revert describe count() workaround #5556

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apply workaround for #5444 to `DataFrame::describe` #5468

Apply workaround for #5444 to `DataFrame::describe` #5468

jiangzhx commented Mar 3, 2023 •

edited

Loading

alamb left a comment

alamb Mar 3, 2023

jiangzhx Mar 4, 2023

Jefffrey commented Mar 4, 2023

jiangzhx commented Mar 4, 2023

Jefffrey commented Mar 4, 2023

jiangzhx commented Mar 4, 2023

Jefffrey commented Mar 4, 2023

jiangzhx commented Mar 4, 2023 •

edited

Loading

Jefffrey commented Mar 4, 2023

jiangzhx commented Mar 4, 2023

alamb Mar 4, 2023

alamb left a comment

Jefffrey left a comment

alamb commented Mar 6, 2023

ursabot commented Mar 6, 2023

Apply workaround for #5444 to DataFrame::describe #5468

Apply workaround for #5444 to DataFrame::describe #5468

Conversation

jiangzhx commented Mar 3, 2023 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

alamb left a comment

Choose a reason for hiding this comment

alamb Mar 3, 2023

Choose a reason for hiding this comment

jiangzhx Mar 4, 2023

Choose a reason for hiding this comment

Jefffrey commented Mar 4, 2023

jiangzhx commented Mar 4, 2023

Jefffrey commented Mar 4, 2023

jiangzhx commented Mar 4, 2023

Jefffrey commented Mar 4, 2023

jiangzhx commented Mar 4, 2023 • edited Loading

Jefffrey commented Mar 4, 2023

jiangzhx commented Mar 4, 2023

alamb Mar 4, 2023

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Jefffrey left a comment

Choose a reason for hiding this comment

alamb commented Mar 6, 2023

ursabot commented Mar 6, 2023

Apply workaround for #5444 to `DataFrame::describe` #5468

Apply workaround for #5444 to `DataFrame::describe` #5468

jiangzhx commented Mar 3, 2023 •

edited

Loading

jiangzhx commented Mar 4, 2023 •

edited

Loading