Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change required input ordering physical plan API to allow any NULLS FIRST / LAST and ASC / DESC #5772

Merged
merged 2 commits into from
Mar 30, 2023
Merged

Change required input ordering physical plan API to allow any NULLS FIRST / LAST and ASC / DESC #5772

merged 2 commits into from
Mar 30, 2023

Conversation

mustafasrepo
Copy link
Contributor

@mustafasrepo mustafasrepo commented Mar 29, 2023

Which issue does this PR close?

N.A

Note

The changes in this PR is received from the #5290 by @mingmwang.

Rationale for this change

For some executors it is enough that its input is ordered, however direction of ordering isn't important (such as PARTITION BY columns in the WindowAggExec). Current required_input_ordering API Vec<Option<&[PhysicalSortExpr]>> doesn't supports this subtlety. Where PhysicalSortExpr is a struct encapsulates expr: Arc<dyn PhysicalExpr> and options: SortOptions.

What changes are included in this PR?

In this PR we change required_input_ordering
from fn required_input_ordering(&self) -> Vec<Option<&[PhysicalSortExpr]>>
to fn required_input_ordering(&self) -> Vec<Option<Vec<PhysicalSortRequirement>>>, where PhysicalSortRequirement is a struct encapsulates expr: Arc<dyn PhysicalExpr> and options: Option<SortOptions>. If options field is None. It means that executor expects its input to be ordered but direction doesn't not matter.
Also some util functions to convert in between these structs are added.

Are these changes tested?

Existing tests should work.

Are there any user-facing changes?

api change

@github-actions github-actions bot added core Core DataFusion crate physical-expr Physical Expressions labels Mar 29, 2023
@mustafasrepo mustafasrepo changed the title Change required input ordering format Change required input ordering API Mar 29, 2023
@mingmwang
Copy link
Contributor

LGTM

Comment on lines +191 to +214
pub(crate) fn calc_requirements(
partition_by_exprs: &[Arc<dyn PhysicalExpr>],
orderby_sort_exprs: &[PhysicalSortExpr],
) -> Option<Vec<PhysicalSortRequirement>> {
let mut sort_reqs = vec![];
for partition_by in partition_by_exprs {
sort_reqs.push(PhysicalSortRequirement {
expr: partition_by.clone(),
options: None,
});
}
for PhysicalSortExpr { expr, options } in orderby_sort_exprs {
let contains = sort_reqs.iter().any(|e| expr.eq(&e.expr));
if !contains {
sort_reqs.push(PhysicalSortRequirement {
expr: expr.clone(),
options: Some(*options),
});
}
}
// Convert empty result to None. Otherwise wrap result inside Some()
(!sort_reqs.is_empty()).then_some(sort_reqs)
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is overlap between partition by keys and sort keys, I think we should respect the sort key's requirements.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case of PARTITION BY a, ORDER BY a. We do not add requirement for column a. The reason is that all partition would consist of the same value of a. Hence ORDER BY doesn't really define a direction(all a values would be same.). Above operation can produce correct result when its input is ordered according a ASC or a DESC.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if the original SQL is PARTITION BY a, ORDER BY a DESC, should we respect the direction? I'm not sure for this.
It think for cases like ROWNUM() over, the direction matters.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since ORDER BY is local to each partition, I think we are fine here. There is some related discussion here: https://stackoverflow.com/questions/50364818/using-the-same-column-in-partition-by-and-order-by-with-dense-rank

I am still taking a note to remind us that we may want to revisit this if we find out information indicating otherwise.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Computing rownum() over PARTITION BY a, ORDER BY a DESC I think will result in arbitrary assignments of row numbers (as @ozankabak and @mustafasrepo say above, the value of a within each partition is the same)

})
}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@alamb alamb added the api change Changes the API exposed to users of the crate label Mar 29, 2023
@alamb alamb changed the title Change required input ordering API Change required input ordering API to allow any NULLS FIRST / LAST and ASC / DESC Mar 29, 2023
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @mingmwang and @mustafasrepo

FWIW I will try and update IOx to use this new API shortly after the PR is merged. I found it challenging last time (see related discussion https://github.com/apache/arrow-datafusion/pull/5661/files#r1148410281) as the signature

Vec<Option<Vec<PhysicalSortRequirement>>>

Imposes a significant cognitive load. However given this PR moves things forward and we don't have a competing alternate proposal I think we should merge the API as is.

for partition_by in partition_by_exprs {
sort_reqs.push(PhysicalSortRequirement {
expr: partition_by.clone(),
options: None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it correct that being able to express options:None for partitioning operations is the key rationale (thing we can't do before) for this PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. I vaguely remember us discussing internally that there are also some other use cases where some sort of ordering is required but the options don't matter, but I can't recall now

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another example that the sorting direction is not that important is the SortMergeJoin, but the additional requirements
is for all its input, the ordering should be aligned, should be ASC or DESC altogether. I think the current PhysicalSortRequirement API can not represent this alignment constraints explicitly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other example is SortBasedAggregator, currently we do not support SortBasedAggregator in DataFusion

Comment on lines +191 to +214
pub(crate) fn calc_requirements(
partition_by_exprs: &[Arc<dyn PhysicalExpr>],
orderby_sort_exprs: &[PhysicalSortExpr],
) -> Option<Vec<PhysicalSortRequirement>> {
let mut sort_reqs = vec![];
for partition_by in partition_by_exprs {
sort_reqs.push(PhysicalSortRequirement {
expr: partition_by.clone(),
options: None,
});
}
for PhysicalSortExpr { expr, options } in orderby_sort_exprs {
let contains = sort_reqs.iter().any(|e| expr.eq(&e.expr));
if !contains {
sort_reqs.push(PhysicalSortRequirement {
expr: expr.clone(),
options: Some(*options),
});
}
}
// Convert empty result to None. Otherwise wrap result inside Some()
(!sort_reqs.is_empty()).then_some(sort_reqs)
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Computing rownum() over PARTITION BY a, ORDER BY a DESC I think will result in arbitrary assignments of row numbers (as @ozankabak and @mustafasrepo say above, the value of a within each partition is the same)


/// Checks whether the given [`PhysicalSortRequirement`]s are satisfied by the
/// provided [`PhysicalSortExpr`]s.
pub fn ordering_satisfy_requirement_concrete<F: FnOnce() -> EquivalenceProperties>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this meant to be pub? It seems like callers should always use ordering_satisfy_requirement, right?

Copy link
Contributor Author

@mustafasrepo mustafasrepo Mar 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although it is not absolutely necessary for both ordering_satisfy_requirement and ordering_satisfy_requirement_concrete to be public. Their API is different, we may want to use both of them. Actually, sort_enforcement uses ordering_satisfy_requirement_concrete.

@ozankabak
Copy link
Contributor

Thanks @alamb -- I will create a separate follow-on that leverages type aliases to simplify things after this and its sister PR merges.

@alamb
Copy link
Contributor

alamb commented Mar 29, 2023

I'll plan to merge this tomorrow unless anyone else would like time to comment

@alamb alamb changed the title Change required input ordering API to allow any NULLS FIRST / LAST and ASC / DESC Change required input ordering physical plan API to allow any NULLS FIRST / LAST and ASC / DESC Mar 30, 2023
@alamb alamb merged commit c9bf3f3 into apache:main Mar 30, 2023
@alamb
Copy link
Contributor

alamb commented Apr 4, 2023

Thanks @alamb -- I will create a separate follow-on that leverages type aliases to simplify things after this and its sister PR merges.

I have made a PR #5863 that doesn't change the type signatures but I think will make using this new structure easier.

@mustafasrepo mustafasrepo deleted the feature/exec_plan_req_ordering branch April 11, 2023 06:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api change Changes the API exposed to users of the crate core Core DataFusion crate physical-expr Physical Expressions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants