-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prune columns are all null in ParquetExec by row_counts , handle IS NOT NULL #9989
Conversation
@@ -1026,15 +1028,17 @@ mod tests { | |||
column_statistics: Vec<ParquetStatistics>, | |||
) -> RowGroupMetaData { | |||
let mut columns = vec![]; | |||
let number_row = 1000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before all unit test set each col with default 0 row
, will trigger num_rows == num_nulls
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering if this could cause problems in real files (for example, if the row counts were not included in the statistics that were written into the file).
However, I double checked the code and it seems like ColumnChunkMetaData::num_values
is non nullable so I think we are good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me 👍 Thanks @Ted-Jiang
None | ||
fn row_counts(&self, column: &Column) -> Option<ArrayRef> { | ||
let (c, _) = self.column(&column.name)?; | ||
let scalar = ScalarValue::UInt64(Some(c.num_values() as u64)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another implementor returns an Int64
array. I'm not sure if the type matters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Try to keep with https://github.com/apache/arrow-datafusion/blob/18c9f4d7e68b3eb470b2b2ef3f2297e018dd4298/datafusion/core/src/datasource/physical_plan/parquet/row_groups.rs#L340-L342 avoid cast during =
, i think Int64Array only use in test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UInt64
also matches the documentation https://docs.rs/datafusion/latest/datafusion/physical_optimizer/pruning/trait.PruningStatistics.html#tymethod.row_counts 👍
I made a PR to fix the tests: #10004
Co-authored-by: Ruihang Xia <[email protected]>
Thanks @Ted-Jiang -- this looks awesome -- I look forward to reviewing it later today |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Ted-Jiang -- and @waynexia . I think the code looks good, though I think a few more tests would improve the coverage we could add them as a follow on PR.
Do you plan to add support in page_filter.rs
as well (maybe that is why the PR is marked "Part #9961 ")?
@@ -1026,15 +1028,17 @@ mod tests { | |||
column_statistics: Vec<ParquetStatistics>, | |||
) -> RowGroupMetaData { | |||
let mut columns = vec![]; | |||
let number_row = 1000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering if this could cause problems in real files (for example, if the row counts were not included in the statistics that were written into the file).
However, I double checked the code and it seems like ColumnChunkMetaData::num_values
is non nullable so I think we are good.
// After pruning, only row group 1,3 should be selected | ||
RowGroupPruningTest::new() | ||
.with_scenario(Scenario::AllNullValues) | ||
.with_query("SELECT * FROM t WHERE \"i8\" is Null") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also add a tests:
i16 IS NOT NULL
(to cover the opposite)i32 > 7
(prune via nulls and some via counts)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None | ||
fn row_counts(&self, column: &Column) -> Option<ArrayRef> { | ||
let (c, _) = self.column(&column.name)?; | ||
let scalar = ScalarValue::UInt64(Some(c.num_values() as u64)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UInt64
also matches the documentation https://docs.rs/datafusion/latest/datafusion/physical_optimizer/pruning/trait.PruningStatistics.html#tymethod.row_counts 👍
I made a PR to fix the tests: #10004
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Ted-Jiang -- the support for IS NOT NULL
is also quite neat.
@alamb sorry there is a bug here i will fix this it should
|
…atistics
Which issue does this PR close?
Part #9961 .
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?