Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add PRIMARY KEY Aggregate support to dataframe API #8356

Merged
merged 88 commits into from
Dec 8, 2023
Merged

Add PRIMARY KEY Aggregate support to dataframe API #8356

merged 88 commits into from
Dec 8, 2023

Conversation

mustafasrepo
Copy link
Contributor

Which issue does this PR close?

Closes #.

Rationale for this change

Currently we can rewrite group by expressions according to primary key.
Consider query below

SELECT ts, sn, SUM(amount) as sum1
  FROM sales_global_with_pk
  GROUP BY sn

normally, after aggregation only sn and SUM(amount) as sum1 would be available for output. However, given sn is PRIMARY_KEY, we know that each sn value maps to a fixed ts value (e.g They are functionally depandant). Hence, by treating query above as

SELECT ts, sn, SUM(amount) as sum1
  FROM sales_global_with_pk
  GROUP BY ts, sn

under the hood, we can output ts also at the end. However, this feature was only available with SQL API. See discussion.

This PR brings this support to the dataframe API also. With this PR following dataframe query can be executed given col_id is PRIMARY_KEY.

        // group by contains id column
        let group_expr = vec![col_id.clone()];
        let aggr_expr = vec![];
        let df = df.aggregate(group_expr, aggr_expr)?;

        // expr list contains id, name (e.g projection)
        let expr_list = vec![col_id, col_name];
        let df = df.select(expr_list)?;

What changes are included in this PR?

Are these changes tested?

Yes, most of the changes comes from either plan tests, or .slt test

Are there any user-facing changes?

@github-actions github-actions bot added sql SQL Planner logical-expr Logical plan and expressions optimizer Optimizer rules core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Nov 29, 2023
@alamb
Copy link
Contributor

alamb commented Nov 30, 2023

I plan to review this PR tomorrow if no one beats me to it

ozankabak and others added 13 commits December 4, 2023 17:40
# Conflicts:
#	datafusion/optimizer/src/optimize_projections.rs
#	datafusion/optimizer/src/optimizer.rs
# Please enter a commit message to explain why this merge is necessary,
# especially if it merges an updated upstream into a topic branch.
#
# Lines starting with '#' will be ignored, and an empty message aborts
# the commit.
# Conflicts:
#	datafusion/optimizer/src/optimize_projections.rs
Copy link
Contributor

@ozankabak ozankabak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed this very carefully (over 6 sittings as you can see in the commit history!) and it is almost ready to merge. It pays a longstanding tech debt where functional dependencies worked in SQL but not in dataframe API.

I asked @mustafasrepo to make one little optional fix and then it will be ready to go.

mustafasrepo and others added 2 commits December 8, 2023 14:34
# Conflicts:
#	datafusion/optimizer/src/optimize_projections.rs
@ozankabak
Copy link
Contributor

Since this has been around for a week, it doesn't make any potentially disruptive change, pays an old technical debt, and that I have already reviewed it very carefully; I will go ahead and merge this.

In case anything breaks, let us know and we will promptly fix it.

@ozankabak ozankabak merged commit 8f9d6e3 into apache:main Dec 8, 2023
22 checks passed
appletreeisyellow pushed a commit to appletreeisyellow/datafusion that referenced this pull request Dec 15, 2023
* Aggregate rewrite for dataframe API.

* Simplifications

* Minor changes

* Minor changes

* Add new test

* Add new tests

* Minor changes

* Add rule, for aggregate simplification

* Simplifications

* Simplifications

* Simplifications

* Minor changes

* Simplifications

* Add new test condition

* Tmp

* Push requirement below aggregate

* Add join and subqeury alias

* Add cross join support

* Minor changes

* Add logical plan repartition support

* Add union support

* Add table scan

* Add limit

* Minor changes, buggy

* Add new tests, fix existing bugs

* change concat type array_concat

* Resolve some of the bugs

* Comment out a rule

* All tests pass, when single distinct is closed

* Fix aggregate bug

* Change analyze and explain implementations

* All tests pass

* Resolve linter errors

* Simplifications, remove unnecessary codes

* Comment out tests

* Remove pushdown projection

* Pushdown empty projections

* Fix failing tests

* Simplifications

* Update comments, simplifications

* Remove eliminate projection rule, Add method for group expr len aggregate

* Simplifications, subquery support

* Update comments, add unnest support, simplifications

* Remove eliminate projection pass

* Change name

* Minor changes

* Minor changes

* Add comments

* Fix failing test

* Minor simplifications

* update

* Minor

* Remove ordering

* Minor changes

* add merge projections

* Add comments, resolve linter errors

* Minor changes

* Minor changes

* Minor changes

* Minor changes

* Minor changes

* Minor changes

* Minor changes

* Minor changes

* Review Part 1

* Review Part 2

* Fix quadratic search, Change trim_expr impl

* Review Part 3

* Address reviews

* Minor changes

* Review Part 4

* Add case expr support

* Review Part 5

* Review Part 6

* Finishing touch: Improve comments

---------

Co-authored-by: berkaysynnada <[email protected]>
Co-authored-by: Mehmet Ozan Kabak <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate logical-expr Logical plan and expressions optimizer Optimizer rules sql SQL Planner sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants