-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Minor: clean up data page statistics tests and fix bugs #11236
Minor: clean up data page statistics tests and fix bugs #11236
Conversation
…Binary data still incomplete. Struct not implemented. Two failing tests that need further investigation.
… tests, though uncertain why the tests were failing before the change.
@@ -766,7 +766,7 @@ macro_rules! get_data_page_statistics { | |||
[<$stat_type_prefix Int32DataPageStatsIterator>]::new($iterator) | |||
.map(|x| { | |||
x.into_iter().filter_map(|x| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing these made the tests pass, but it seems potentially incorrect to me? It is the same thing as is being done in the row summary tests. See
datafusion/datafusion/core/src/datasource/physical_plan/parquet/statistics.rs
Lines 331 to 336 in 3421b52
DataType::UInt32 => Ok(Arc::new(UInt32Array::from_iter( | |
[<$stat_type_prefix Int32StatsIterator>]::new($iterator).map(|x| x.map(|x| *x as u32)), | |
))), | |
DataType::UInt64 => Ok(Arc::new(UInt64Array::from_iter( | |
[<$stat_type_prefix Int64StatsIterator>]::new($iterator).map(|x| x.map(|x| *x as u64)), | |
))), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok - I think I've figured it out. This is actually fixing a bug because the Int32 or Int64 should be coerced into an unsigned value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the parquet physical types are always signed (because of its heavy java influence) so when storing unsigned values in arrow they need to be cast using as
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love it -- thank you so much @efredine
@@ -766,7 +766,7 @@ macro_rules! get_data_page_statistics { | |||
[<$stat_type_prefix Int32DataPageStatsIterator>]::new($iterator) | |||
.map(|x| { | |||
x.into_iter().filter_map(|x| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the parquet physical types are always signed (because of its heavy java influence) so when storing unsigned values in arrow they need to be cast using as
* Change data page statistics to Check::Both for most remaining tests. Binary data still incomplete. Struct not implemented. Two failing tests that need further investigation. * Enables Check::Both for test_numeric_limits_unsigned and fixes broken tests, though uncertain why the tests were failing before the change. --------- Co-authored-by: Eric Fredine <[email protected]>
* Change data page statistics to Check::Both for most remaining tests. Binary data still incomplete. Struct not implemented. Two failing tests that need further investigation. * Enables Check::Both for test_numeric_limits_unsigned and fixes broken tests, though uncertain why the tests were failing before the change. --------- Co-authored-by: Eric Fredine <[email protected]>
Which issue does this PR close?
Closes #11235.
Rationale for this change
There were several tests where we missed changing
Check::RowGroup
toCheck::Both
.What changes are included in this PR?
Changes most remaining tests in https://github.com/apache/datafusion/blob/3421b52605b00cd2e5a6498ea210cce196a19496/datafusion/core/tests/parquet/arrow_statistics.rs to
Check::Both
. The only two remaining tests are the one still in flight in #11200 and the one for nested Struct which hasn't been implemented yet.However, note that there were two failing tests:
datafusion/datafusion/core/tests/parquet/arrow_statistics.rs
Line 1474 in 3421b52
and
datafusion/datafusion/core/tests/parquet/arrow_statistics.rs
Line 1485 in 3421b52
I fixed these tests from changing from
u32::try_from(x).ok
toSome(x as u32)
and similar for u64. I did this because I noticed its what the existing Row Group statistics tests are doing as well. But I think this is probably the right thing to do given that we want to cast from the Int32 or Int64 value in the Parquet file into a unsigned int.Are these changes tested?
Yes
Are there any user-facing changes?
No