Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support min max aggregates in window functions with sliding windows #4675

Merged
merged 14 commits into from Dec 22, 2022
Merged

Support min max aggregates in window functions with sliding windows #4675

merged 14 commits into from Dec 22, 2022

Conversation

ghost
Copy link

@ghost ghost commented Dec 19, 2022

Which issue does this PR close?

Closes #4603 and closes #4402.

Rationale for this change

This PR adds support for running MIN-MAX accumulators using custom window frames. Since adding support for sliding window for min-max introduces performance penalty, we have added ForwardAggregateWindowExpr, which is a special implementation for expressions that don't require retract. By using this struct, if not needed, we do not introduce performance penalty. As an example window expression, MIN(c12) OVER (ORDER BY C12 RANGE BETWEEN 0.3 PRECEDING AND 0.2 FOLLOWING) will use MinAccumulator to calculate its result (where implementation is sliding and introduces additional overhead). However, window expression MIN(c12) OVER (ORDER BY C12 RANGE BETWEEN UNBOUNDED PRECEDING AND 0.2 FOLLOWING) will use MinRowAccumulator which is a better implementation when retract is not a concern.

What changes are included in this PR?

update_batch and retract_batch methods of MIN, MAX accumulators are implemented. The algorithm and data structure used are described here. If an accumulator has a support for better implementation when retract is unnecessary, we now use optimized implementation during execution.

Are these changes tested?

Yes. aggregate_min_max_w_custom_window_frames test function is added.

Are there any user-facing changes?

No.

@github-actions github-actions bot added core Core DataFusion crate logical-expr Logical plan and expressions physical-expr Physical Expressions labels Dec 19, 2022
@ghost ghost closed this Dec 19, 2022
@ghost ghost reopened this Dec 19, 2022
@ghost ghost changed the title Feature/min max aggregate with custom windows min max aggregate with custom windows Dec 19, 2022
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the contribution @berkaycpp -- this looks like a good start

I have some performance concerns, which I described in detail.

I also think the code needs significantly more test coverage - the sql level MIN/MAX tests are a good start but I don't think they necessarily hit all the corner cases.

The sliding_window is a fascinating proposal

cc @Ted-Jiang

@mustafasrepo
Copy link
Contributor

Thanks @alamb for your feedbacks. We will address them as soon as possible. @Ted-Jiang if you have time, we would like to have your feedback also regarding this design.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than the changes in datafusion/core/src/datasource/file_format/parquet.rs and datafusion/core/src/datasource/mod.rs I think this PR looks ready to go.

Thank you for your work on it

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @berkaycpp and @mustafasrepo

@alamb alamb changed the title min max aggregate with custom windows Support min max aggregates in window functions with sliding windows Dec 22, 2022
@alamb alamb merged commit afb1ae2 into apache:master Dec 22, 2022
@ursabot
Copy link

ursabot commented Dec 22, 2022

Benchmark runs are scheduled for baseline = 77991a3 and contender = afb1ae2. afb1ae2 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java


// The implementation is taken from https://github.com/spebern/moving_min_max/blob/master/src/lib.rs.

//! Keep track of the minimum or maximum value in a sliding window.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an update here, I subsequently discovered a possible better algorithm for this work -- see #4904 for details

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate logical-expr Logical plan and expressions physical-expr Physical Expressions
Projects
None yet
3 participants