Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update code comment for the cases of regularized RANGE frame and add tests for ORDER BY cases with RANGE frame #8410

Merged
merged 5 commits into from
Dec 4, 2023

Conversation

viirya
Copy link
Member

@viirya viirya commented Dec 3, 2023

Which issue does this PR close?

Closes #.

Rationale for this change

We have regularized RANGE window frame to ROWS frame if ORDER BY clause is empty under special cases (based on the comment), but the code actually ignores non-empty ORDER BY clause which implicitly allows more than one ORDER BY columns. This case is not reflected in the code comment. This updates code comment to make it more clear.

This also adds a few tests in sqllogictest for various ORDER BY cases with RANGE frame. By comparing their results with Postgres, there are a few TODOs added for the difference.

What changes are included in this PR?

Updated code comment and added tests.

Are these changes tested?

Added tests to sqllogictest.

Are there any user-facing changes?

@github-actions github-actions bot added logical-expr Logical plan and expressions sqllogictest SQL Logic Tests (.slt) labels Dec 3, 2023
// 1. start bound is UNBOUNDED or CURRENT ROW
// 2. end bound is CURRENT ROW or UNBOUNDED.
// In these cases, we regularize the RANGE frame to be equivalent to a ROWS
// frame with the UNBOUNDED bounds.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that actually Spark doesn't allow empty ORDER BY clause for RANGE frame in any cases. I updated this according to the comment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So DataFusion is following Postgres behavior on this.

postgres=# select a, rank() over (partition by a RANGE between 1 PRECEDING AND 2 FOLLOWING) rnk from (select 1 a) q;
ERROR:  RANGE with offset PRECEDING/FOLLOWING requires exactly one ORDER BY column
LINE 1: select a, rank() over (partition by a RANGE between 1 PRECED...
                              ^
postgres=# select a, rank() over (partition by a RANGE between UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) rnk from (select 1 a) q;
 a | rnk 
---+-----
 1 |   1
(1 row)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

postgres=# select a, rank() over (partition by a RANGE between 1 PRECEDING AND 2 FOLLOWING) rnk from (select 1 a) q;
ERROR:  RANGE with offset PRECEDING/FOLLOWING requires exactly one ORDER BY column
LINE 1: select a, rank() over (partition by a RANGE between 1 PRECED...
                              ^
postgres=# select a, rank() over (partition by a order by a RANGE between 1 PRECEDING AND 2 FOLLOWING) rnk from (select 1 a) q;
 a | rnk 
---+-----
 1 |   1
(1 row)

postgres=# select a, rank() over (partition by a order by a, a + 1 RANGE between 1 PRECEDING AND 2 FOLLOWING) rnk from (select 1 a) q;
ERROR:  RANGE with offset PRECEDING/FOLLOWING requires exactly one ORDER BY column
LINE 1: select a, rank() over (partition by a order by a, a + 1 RANG...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, actually Postgres behavior allows empty ORDER BY clause, one column or multiple column ORDER BY clause with RANGE frame if start bound/end bound is UNBOUNDED or CURRENT ROW.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In short, DataFusion currently allows various ORDER BY clause cases (empty, 1 or more column) with RANGE frame, similar to Postgres.

But for empty ORDER BY clause case, the window function result is different to Postgres.

See added tests.

Comment on lines +163 to +166
// If an ORDER BY clause is absent, the frame is equivalent to a ROWS
// frame with the UNBOUNDED bounds.
// If an ORDER BY clause is present but has more than one column, the
// frame is unchanged.
Copy link
Member Author

@viirya viirya Dec 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, based on Postgres behavior, it doesn't change the RANGE frame to ROWS frame (see the added tests in sqllogictest which return different results than Postgres).

Based on Postgres doc, without ORDER BY, Postgres treats all rows of the partition as peers of the current row (https://www.postgresql.org/docs/current/sql-expressions.html#SYNTAX-WINDOW-FUNCTIONS).

@viirya viirya changed the title fix: RANGE frame can be regularized to ROWS frame only if empty ORDER BY clause doc: Update code comment for the cases of regularized RANGE frame Dec 4, 2023
@viirya viirya changed the title doc: Update code comment for the cases of regularized RANGE frame Update code comment for the cases of regularized RANGE frame and add tests for ORDER BY cases with RANGE frame Dec 4, 2023
Comment on lines +3781 to +3787
# TODO: this is different to Postgres which returns [1, 1] for `rnk`.
query I
select rank() over (order by 1 RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) rnk
from (select 1 a union select 2 a) q ORDER BY rnk
----
1
2
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a fix for the issue locally. But this PR focuses on doc and test change, so I will submit the fix after this is merged.

Copy link
Member

@jackwener jackwener left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch and great job! Thank you @viirya.

This PR has been very carefully described and explained

@viirya viirya merged commit 0bcf462 into apache:main Dec 4, 2023
25 checks passed
@viirya
Copy link
Member Author

viirya commented Dec 4, 2023

Thanks @jackwener

appletreeisyellow pushed a commit to appletreeisyellow/datafusion that referenced this pull request Dec 15, 2023
…tests for ORDER BY cases with RANGE frame (apache#8410)

* fix: RANGE frame can be regularized to ROWS frame only if empty ORDER BY clause

* Fix flaky test

* Update test comment

* Add code comment

* Update
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
logical-expr Logical plan and expressions sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants