Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix window spec to support scalars #8402

Closed
wants to merge 0 commits into from
Closed

Conversation

comphead
Copy link
Contributor

@comphead comphead commented Dec 3, 2023

Which issue does this PR close?

Closes #8386.

Rationale for this change

The window expression spec fails if PARTITION BY or ORDER BY contains scalar, whereas such query considered as valid in PG

What changes are included in this PR?

Ignore scalars for PARTITION BY or ORDER BY as semantically it means nothing

Are these changes tested?

Yes

Are there any user-facing changes?

No

@github-actions github-actions bot added sql SQL Planner optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt) labels Dec 3, 2023
2 1 1

# test window spec PARTITION BY/ORDER BY for scalar values should be ignored, only scalars in window spec
# known current issue with RANK with empty WINDOW spec. Fix test after RANK issue resolved
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To fix it filed #8403

@comphead
Copy link
Contributor Author

comphead commented Dec 3, 2023

Filed #8403 to fix RANK function

Comment on lines 506 to 508
if col_type.is_numeric()
|| is_utf8_or_large_utf8(col_type)
|| matches!(col_type, DataType::Null)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do added tests cover this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, the test tests also nulls in window spec

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, the null must be the first order by expression, does it?

Comment on lines 103 to 111
// ignore window spec ORDER BY for scalar values
.filter(|e| {
!matches!(
e,
OrderByExpr {
expr: sqlparser::ast::Expr::Value { .. },
..
}
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, as in coerce_window_frame the first order by expression's datatype is used to coerce start bound and end bound of Range window frame. If the only one order by expr is scalar, which is filtered out here, won't it return error?

if let Some(col_type) = current_types.first() {
    ...
} else {
    return internal_err!("ORDER BY column cannot be empty");
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its okay, it doesn't fail.... I tried to add a test with pure scalars but it is not stable because of other reasons. I'll decompose the test for now for single rows to prove it doesn't fail

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did you test it? Did you use RANGE frame? It must be RANGE that triggers the error:

query error DataFusion error: Error during planning: RANGE requires exactly one ORDER BY column
select a,
       rank() over (partition by a order by 1 RANGE 10 PRECEDING) rnk
       from (select 1 a union all select 2 a) q

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

initial tests are in the ticket description
#8386 (comment)

the small local test

❯ select rank() over (order by null) from (select 1 a union all select 2 a) q;

Error during planning: Sort operation is not applicable to scalar value NULL

In pg/duckdb

D select rank() over (order by null) from (select 1 a union all select 2 a) q;
┌─────────────────────────────┐
│ rank() OVER (ORDER BY NULL) │
│            int64            │
├─────────────────────────────┤
│                           1 │
│                           1 │
└─────────────────────────────┘

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if I run it now with pure null, I'm getting another issue

select rank() over (order by null) from (select 1 a union all select 2 a) q;
Arrow error: Invalid argument error: number of columns(2) must match number of fields(1) in schema

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

weird, it works for

select rank() over (order by 1) from (select 1 a union all select 2 a) q;

but fails for

select rank() over (order by null) from (select 1 a union all select 2 a) q;

Digging

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mustafasrepo can I have your input please. The reason is the query fails

select rank() over (order by 1) rnk from (select 1 a union all select 2 a) x
Arrow error: Invalid argument error: number of columns(2) must match number of fields(1) in schema

The reason for that is optimize_projections removes the unused fields and this breaks the consistency. The question for you, what do you think is expected behavior for such rare case?

PS. If add a column name to projection or to ORDER BY it will expectedly work.

select a,
rank() over (partition by 10, 'a', null order by 20, 10, 'a', null) rnk,
row_number() over (partition by 10, 'a', null order by 20, 10, 'a', null) rn
from (select 1 a) q
Copy link
Contributor

@haohuaijin haohuaijin Dec 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe you can use order by a to make the result stable.

select a, 
       rank() over (partition by 10, a, 'a', null order by 20, a, 10, 'a', null) rnk,
       row_number() over (partition by 10, a, 'a', null order by 20, a, 10, 'a', null) rn
       from (select 2 a union all select 1 a) q order by a;

Copy link
Contributor Author

@comphead comphead Dec 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @haohuaijin 'll do. Before that I was confused (#8403 (comment)) and thought it is another bug so made a row level workaround. I was able to reproduce the same in DuckDB so this is expected, not a bug I'll fix the test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
optimizer Optimizer rules sql SQL Planner sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG]: windows frame don't support scalars in ORDER BY/PARTITION BY clauses
3 participants