Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minor: clean up data page statistics tests and fix bugs #11236

Merged
merged 2 commits into from
Jul 3, 2024

Conversation

efredine
Copy link
Contributor

@efredine efredine commented Jul 3, 2024

Which issue does this PR close?

Closes #11235.

Rationale for this change

There were several tests where we missed changing Check::RowGroup to Check::Both.

What changes are included in this PR?

Changes most remaining tests in https://github.com/apache/datafusion/blob/3421b52605b00cd2e5a6498ea210cce196a19496/datafusion/core/tests/parquet/arrow_statistics.rs to Check::Both. The only two remaining tests are the one still in flight in #11200 and the one for nested Struct which hasn't been implemented yet.

However, note that there were two failing tests:

expected_max: Arc::new(UInt32Array::from(vec![100, u32::MAX])),

and

expected_max: Arc::new(UInt64Array::from(vec![100, u64::MAX])),

I fixed these tests from changing from u32::try_from(x).ok to Some(x as u32) and similar for u64. I did this because I noticed its what the existing Row Group statistics tests are doing as well. But I think this is probably the right thing to do given that we want to cast from the Int32 or Int64 value in the Parquet file into a unsigned int.

Are these changes tested?

Yes

Are there any user-facing changes?

No

Eric Fredine added 2 commits July 2, 2024 17:27
…Binary data still incomplete. Struct not implemented. Two failing tests that need further investigation.
… tests, though uncertain why the tests were failing before the change.
@github-actions github-actions bot added the core Core DataFusion crate label Jul 3, 2024
@@ -766,7 +766,7 @@ macro_rules! get_data_page_statistics {
[<$stat_type_prefix Int32DataPageStatsIterator>]::new($iterator)
.map(|x| {
x.into_iter().filter_map(|x| {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing these made the tests pass, but it seems potentially incorrect to me? It is the same thing as is being done in the row summary tests. See

DataType::UInt32 => Ok(Arc::new(UInt32Array::from_iter(
[<$stat_type_prefix Int32StatsIterator>]::new($iterator).map(|x| x.map(|x| *x as u32)),
))),
DataType::UInt64 => Ok(Arc::new(UInt64Array::from_iter(
[<$stat_type_prefix Int64StatsIterator>]::new($iterator).map(|x| x.map(|x| *x as u64)),
))),

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok - I think I've figured it out. This is actually fixing a bug because the Int32 or Int64 should be coerced into an unsigned value.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the parquet physical types are always signed (because of its heavy java influence) so when storing unsigned values in arrow they need to be cast using as

@efredine efredine changed the title Minor: clean up data page statistics tests Minor: clean up data page statistics tests and fix bugs Jul 3, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love it -- thank you so much @efredine

@@ -766,7 +766,7 @@ macro_rules! get_data_page_statistics {
[<$stat_type_prefix Int32DataPageStatsIterator>]::new($iterator)
.map(|x| {
x.into_iter().filter_map(|x| {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the parquet physical types are always signed (because of its heavy java influence) so when storing unsigned values in arrow they need to be cast using as

@alamb alamb merged commit 0922d4a into apache:main Jul 3, 2024
25 checks passed
comphead pushed a commit to comphead/arrow-datafusion that referenced this pull request Jul 8, 2024
* Change data page statistics to Check::Both for most remaining tests. Binary data still incomplete. Struct not implemented. Two failing tests that need further investigation.

* Enables Check::Both for test_numeric_limits_unsigned and fixes broken tests, though uncertain why the tests were failing before the change.

---------

Co-authored-by: Eric Fredine <[email protected]>
findepi pushed a commit to findepi/datafusion that referenced this pull request Jul 16, 2024
* Change data page statistics to Check::Both for most remaining tests. Binary data still incomplete. Struct not implemented. Two failing tests that need further investigation.

* Enables Check::Both for test_numeric_limits_unsigned and fixes broken tests, though uncertain why the tests were failing before the change.

---------

Co-authored-by: Eric Fredine <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Clean Up Data Page Statistics Tests and Fix Bug
2 participants