Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update code comment for the cases of regularized RANGE frame and add tests for ORDER BY cases with RANGE frame #8410

Merged
merged 5 commits into from
Dec 4, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 11 additions & 1 deletion datafusion/expr/src/window_frame.rs
Original file line number Diff line number Diff line change
Expand Up @@ -148,12 +148,22 @@ impl WindowFrame {
pub fn regularize(mut frame: WindowFrame, order_bys: usize) -> Result<WindowFrame> {
if frame.units == WindowFrameUnits::Range && order_bys != 1 {
// Normally, RANGE frames require an ORDER BY clause with exactly one
// column. However, an ORDER BY clause may be absent in two edge cases.
// column. However, an ORDER BY clause may be absent or present but with
// more than one column in two edge cases:
// 1. start bound is UNBOUNDED or CURRENT ROW
// 2. end bound is CURRENT ROW or UNBOUNDED.
// In these cases, we regularize the RANGE frame to be equivalent to a ROWS
// frame with the UNBOUNDED bounds.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that actually Spark doesn't allow empty ORDER BY clause for RANGE frame in any cases. I updated this according to the comment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So DataFusion is following Postgres behavior on this.

postgres=# select a, rank() over (partition by a RANGE between 1 PRECEDING AND 2 FOLLOWING) rnk from (select 1 a) q;
ERROR:  RANGE with offset PRECEDING/FOLLOWING requires exactly one ORDER BY column
LINE 1: select a, rank() over (partition by a RANGE between 1 PRECED...
                              ^
postgres=# select a, rank() over (partition by a RANGE between UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) rnk from (select 1 a) q;
 a | rnk 
---+-----
 1 |   1
(1 row)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

postgres=# select a, rank() over (partition by a RANGE between 1 PRECEDING AND 2 FOLLOWING) rnk from (select 1 a) q;
ERROR:  RANGE with offset PRECEDING/FOLLOWING requires exactly one ORDER BY column
LINE 1: select a, rank() over (partition by a RANGE between 1 PRECED...
                              ^
postgres=# select a, rank() over (partition by a order by a RANGE between 1 PRECEDING AND 2 FOLLOWING) rnk from (select 1 a) q;
 a | rnk 
---+-----
 1 |   1
(1 row)

postgres=# select a, rank() over (partition by a order by a, a + 1 RANGE between 1 PRECEDING AND 2 FOLLOWING) rnk from (select 1 a) q;
ERROR:  RANGE with offset PRECEDING/FOLLOWING requires exactly one ORDER BY column
LINE 1: select a, rank() over (partition by a order by a, a + 1 RANG...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, actually Postgres behavior allows empty ORDER BY clause, one column or multiple column ORDER BY clause with RANGE frame if start bound/end bound is UNBOUNDED or CURRENT ROW.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In short, DataFusion currently allows various ORDER BY clause cases (empty, 1 or more column) with RANGE frame, similar to Postgres.

But for empty ORDER BY clause case, the window function result is different to Postgres.

See added tests.

// Note that this follows Postgres behavior.
if (frame.start_bound.is_unbounded()
|| frame.start_bound == WindowFrameBound::CurrentRow)
&& (frame.end_bound == WindowFrameBound::CurrentRow
|| frame.end_bound.is_unbounded())
{
// If an ORDER BY clause is absent, the frame is equivalent to a ROWS
// frame with the UNBOUNDED bounds.
// If an ORDER BY clause is present but has more than one column, the
// frame is unchanged.
Comment on lines +163 to +166
Copy link
Member Author

@viirya viirya Dec 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, based on Postgres behavior, it doesn't change the RANGE frame to ROWS frame (see the added tests in sqllogictest which return different results than Postgres).

Based on Postgres doc, without ORDER BY, Postgres treats all rows of the partition as peers of the current row (https://www.postgresql.org/docs/current/sql-expressions.html#SYNTAX-WINDOW-FUNCTIONS).

if order_bys == 0 {
frame.units = WindowFrameUnits::Rows;
frame.start_bound =
Expand Down
58 changes: 58 additions & 0 deletions datafusion/sqllogictest/test_files/window.slt
Original file line number Diff line number Diff line change
Expand Up @@ -3727,3 +3727,61 @@ FROM score_board s

statement ok
DROP TABLE score_board;

# Regularize RANGE frame
query error DataFusion error: Error during planning: RANGE requires exactly one ORDER BY column
select a,
rank() over (order by a, a + 1 RANGE BETWEEN UNBOUNDED PRECEDING AND 1 FOLLOWING) rnk
from (select 1 a union select 2 a) q ORDER BY a

query II
select a,
rank() over (order by a RANGE BETWEEN UNBOUNDED PRECEDING AND 1 FOLLOWING) rnk
from (select 1 a union select 2 a) q ORDER BY a
----
1 1
2 2

query error DataFusion error: Error during planning: RANGE requires exactly one ORDER BY column
select a,
rank() over (RANGE BETWEEN UNBOUNDED PRECEDING AND 1 FOLLOWING) rnk
from (select 1 a union select 2 a) q ORDER BY a

query II
select a,
rank() over (order by a, a + 1 RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) rnk
from (select 1 a union select 2 a) q ORDER BY a
----
1 1
2 2

query II
select a,
rank() over (order by a RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) rnk
from (select 1 a union select 2 a) q ORDER BY a
----
1 1
2 2

# TODO: this is different to Postgres which returns [1, 1] for `rnk`.
# Comment it because it is flaky now as it depends on the order of the `a` column.
# query II
# select a,
# rank() over (RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) rnk
# from (select 1 a union select 2 a) q ORDER BY rnk
# ----
# 1 1
# 2 2

# TODO: this works in Postgres which returns [1, 1].
query error DataFusion error: Arrow error: Invalid argument error: must either specify a row count or at least one column
select rank() over (RANGE between UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) rnk
from (select 1 a union select 2 a) q;

# TODO: this is different to Postgres which returns [1, 1] for `rnk`.
query I
select rank() over (order by 1 RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) rnk
from (select 1 a union select 2 a) q ORDER BY rnk
----
1
2
Comment on lines +3781 to +3787
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a fix for the issue locally. But this PR focuses on doc and test change, so I will submit the fix after this is merged.