Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apply workaround for #5444 to DataFrame::describe #5468

Merged
merged 1 commit into from
Mar 6, 2023

Conversation

jiangzhx
Copy link
Contributor

@jiangzhx jiangzhx commented Mar 3, 2023

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added the core Core DataFusion crate label Mar 3, 2023
@alamb alamb changed the title use table stats for descrbe method count result use table stats for DataFrame::describe method count result Mar 3, 2023
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @jiangzhx -- looks like an improvement to me. Maybe @Jefffrey also has some time to review this PR as they were commenting on #5444

.aggregate(
vec![],
fields_iter
vec![RecordBatch::try_new(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't fully understand this change, but it seems ok.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't fix the issue 5444, but it did lead me to discover a better way to speed up the operation of describe.

@Jefffrey
Copy link
Contributor

Jefffrey commented Mar 4, 2023

I don't fully understand these changes either, will need to take a closer look as I'm not yet familiar with describe(...), but it seems this PR is a workaround to the actual issue of #5444 and wouldn't resolve it

@jiangzhx
Copy link
Contributor Author

jiangzhx commented Mar 4, 2023

I don't fully understand these changes either, will need to take a closer look as I'm not yet familiar with describe(...), but it seems this PR is a workaround to the actual issue of #5444 and wouldn't resolve it

yes,#5444 need find other way to fix. it's should not to close .
but #5444 lead me to discover a better way to speed up the operation of describe.

@Jefffrey
Copy link
Contributor

Jefffrey commented Mar 4, 2023

Is the method of speeding up referring to being able to leverage aggregate_statistics physical optimizer if possible to be able to get the stats faster? Then you hit the issue in #5444 due to these lines in describe(...):

https://github.com/apache/arrow-datafusion/blob/e9852074bacd8c891d84eba38b3417aa16a2d18c/datafusion/core/src/dataframe.rs#L441-L454

Where can't find the column by name since the name of the count(...) column is wrong hence goes to the None case. So this is a workaround to make it work?

Is my understanding correct @jiangzhx ?

@jiangzhx
Copy link
Contributor Author

jiangzhx commented Mar 4, 2023

Is the method of speeding up referring to being able to leverage aggregate_statistics physical optimizer if possible to be able to get the stats faster? Then you hit the issue in #5444 due to these lines in describe(...):

https://github.com/apache/arrow-datafusion/blob/e9852074bacd8c891d84eba38b3417aa16a2d18c/datafusion/core/src/dataframe.rs#L441-L454

Where can't find the column by name since the name of the count(...) column is wrong hence goes to the None case. So this is a workaround to make it work?

Is my understanding correct @jiangzhx ?

yes, you are right.

the problem was caused by this function
called Result::unwrap()on anErr value: ArrowError(InvalidArgumentError("It is not possible to concatenate arrays of different data types."))

https://github.com/apache/arrow-datafusion/blob/e9852074bacd8c891d84eba38b3417aa16a2d18c/datafusion/core/src/dataframe.rs#L457-L463

@Jefffrey
Copy link
Contributor

Jefffrey commented Mar 4, 2023

Ok I understand now, thanks.

My opinion is to not have this workaround unless its urgently needed (since this would fix a bug in describe(...) which otherwise can't be avoided at the moment I think), as it adds a bit of complexity to the describe(...) method which would become redundant when the core issue #5444 is resolved (in which case this workaround wouldn't be needed anymore), if things work out right.

But if can be diligent to remove this workaround when #5444 is fixed (and confirmed to also fix this without workaround) then it seems fine.

@jiangzhx
Copy link
Contributor Author

jiangzhx commented Mar 4, 2023

Ok I understand now, thanks.

My opinion is to not have this workaround unless its urgently needed (since this would fix a bug in describe(...) which otherwise can't be avoided at the moment I think), as it adds a bit of complexity to the describe(...) method which would become redundant when the core issue #5444 is resolved (in which case this workaround wouldn't be needed anymore), if things work out right.

But if can be diligent to remove this workaround when #5444 is fixed (and confirmed to also fix this without workaround) then it seems fine.

Thanks for your suggestion @Jefffrey . How about this solution? It might be easier to understand.

//count aggregation
let cnt = self.clone().aggregate(
    vec![],
    original_schema_fields
        .clone()
        .map(|f| count(col(f.name())))
        .collect::<Vec<_>>(),
)?;
let cnt = cnt
    .clone()
    .select(
        cnt.schema()
            .fields()
            .iter()
            .zip(original_schema_fields.clone())
            .map(|(count_field, orgin_field)| {
                col(count_field.name()).alias(orgin_field.name())
            })
            .collect::<Vec<_>>(),
    )
    .unwrap();
let count_record_batch = cnt.collect().await.unwrap();

@Jefffrey
Copy link
Contributor

Jefffrey commented Mar 4, 2023

@jiangzhx That certainly is a more straightforward workaround, if you decide to go with it then having a comment explaining why the workaround is there would be helpful too

@jiangzhx
Copy link
Contributor Author

jiangzhx commented Mar 4, 2023

@jiangzhx That certainly is a more straightforward workaround, if you decide to go with it then having a comment explaining why the workaround is there would be helpful too

The issue has been resolved and the fix implemented was based on your suggestion.
thank you @Jefffrey .

// The optimization of AggregateStatistics will rewrite the physical plan
// for the count function and ignore alias functions,
// as shown in https://github.com/apache/arrow-datafusion/issues/5444.
// This logic should be removed when #5444 is fixed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 -- thank you this comment makes it much clearer what is going on

@jiangzhx jiangzhx changed the title use table stats for DataFrame::describe method count result Apply a patch while Expr.alias function not work with count aggregation #5444 is not fixed Mar 4, 2023
@jiangzhx jiangzhx changed the title Apply a patch while Expr.alias function not work with count aggregation #5444 is not fixed Apply a patch while #5444 is not fixed Mar 4, 2023
@jiangzhx jiangzhx requested a review from alamb March 4, 2023 12:55
@alamb alamb changed the title Apply a patch while #5444 is not fixed Apply workaround for DataFrame::describe while #5444 is not fixed Mar 4, 2023
@alamb alamb changed the title Apply workaround for DataFrame::describe while #5444 is not fixed Apply workaround for #5444 to DataFrame::describe Mar 4, 2023
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM -- thank you @jiangzhx

Copy link
Contributor

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just some minor nitpicks

datafusion/core/src/dataframe.rs Show resolved Hide resolved
datafusion/core/src/dataframe.rs Outdated Show resolved Hide resolved
datafusion/core/src/dataframe.rs Outdated Show resolved Hide resolved
@alamb alamb merged commit 99ef989 into apache:main Mar 6, 2023
@alamb
Copy link
Contributor

alamb commented Mar 6, 2023

Thanks @jiangzhx and thanks @Jefffrey for the review 🙏

@ursabot
Copy link

ursabot commented Mar 6, 2023

Benchmark runs are scheduled for baseline = d0bd28e and contender = 99ef989. 99ef989 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants