Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix data page statistics when all rows are null in a data page #11295

Conversation

efredine
Copy link
Contributor

@efredine efredine commented Jul 5, 2024

Which issue does this PR close?

Closes #11280.

Rationale for this change

When all rows for a data page are null the min and max statistics should be present but null. Some of the data page statistics iterators were incorrectly omitting statistics rather than setting them to null. This results in an array whose length is different from the number of data pages.

What changes are included in this PR?

Adds test for data page statistics for all data types when all rows in a data page are null. Fixes data page statistics iterators that fail these tests.

Are these changes tested?

Yes.

Are there any user-facing changes?

No.

…ull. Fixes most of the failing tests for iterators not handling this situation correctly.
@github-actions github-actions bot added the core Core DataFusion crate label Jul 5, 2024
@efredine efredine marked this pull request as draft July 5, 2024 20:58
@@ -600,6 +601,31 @@ make_data_page_stats_iterator!(
Index::DOUBLE,
f64
);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just consolidating these together.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @efredine -- this looks (really) nice.

Also thank you @Rachelint for the review


// There is one data page with 4 nulls
// The statistics should be present but null
Test {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I verified that this test covered the code changes by running the test without the code changes and it failed as expected.

thread 'parquet::arrow_statistics::test_data_page_stats_with_all_null_page' panicked at datafusion/core/tests/parquet/arrow_statistics.rs:276:13:
assertion `left == right` failed: col: Mismatch with expected data page minimums
  left: PrimitiveArray<UInt64>
[
]
 right: PrimitiveArray<UInt64>
[
  null,
]
stack backtrace:
   0: rust_begin_unwind
             at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/panicking.rs:652:5
   1: core::panicking::panic_fmt
             at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/panicking.rs:72:14
   2: core::panicking::assert_failed_inner
   3: core::panicking::assert_failed
             at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/panicking.rs:364:5
   4: parquet_exec::parquet::arrow_statistics::Test::run_checks
             at ./tests/parquet/arrow_statistics.rs:276:13
   5: parquet_exec::parquet::arrow_statistics::Test::run
             at ./tests/parquet/arrow_statistics.rs:229:9
   6: parquet_exec::parquet::arrow_statistics::test_data_page_stats_with_all_null_page::{{closure}}
             at ./tests/parquet/arrow_statistics.rs:567:9
   7: <core::pin::Pin<P> as core::future::future::Future>::poll
             at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/future/future.rs:123:9
   8: <core::pin::Pin<P> as core::future::future::Future>::poll
             at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/future/future.rs:123:9
   9: tokio::runtime::scheduler::current_thread::CoreGuard::block_on::{{closure}}::{{closure}}::{{closure}}
             at /Users/andrewlamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/scheduler/current_thread/mod.rs:659:57
  10: tokio::runtime::coop::with_budget
             at /Users/andrewlamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/coop.rs:107:5
  11: tokio::runtime::coop::budget
             at /Users/andrewlamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/coop.rs:73:5
  12: tokio::runtime::scheduler::current_thread::CoreGuard::block_on::{{closure}}::{{closure}}
             at /Users/andrewlamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/scheduler/current_thread/mod.rs:659:25
  13: tokio::runtime::scheduler::current_thread::Context::enter
             at /Users/andrewlamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/scheduler/current_thread/mod.rs:404:19
  14: tokio::runtime::scheduler::current_thread::CoreGuard::block_on::{{closure}}
             at /Users/andrewlamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/scheduler/current_thread/mod.rs:658:36
  15: tokio::runtime::scheduler::current_thread::CoreGuard::enter::{{closure}}
             at /Users/andrewlamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/scheduler/current_thread/mod.rs:737:68
  16: tokio::runtime::context::scoped::Scoped<T>::set
             at /Users/andrewlamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/context/scoped.rs:40:9
  17: tokio::runtime::context::set_scheduler::{{closure}}
             at /Users/andrewlamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/context.rs:180:26
  18: std::thread::local::LocalKey<T>::try_with
             at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/thread/local.rs:286:12
  19: std::thread::local::LocalKey<T>::with
             at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/thread/local.rs:262:9
  20: tokio::runtime::context::set_scheduler
             at /Users/andrewlamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/context.rs:180:9
  21: tokio::runtime::scheduler::current_thread::CoreGuard::enter
             at /Users/andrewlamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/scheduler/current_thread/mod.rs:737:27
  22: tokio::runtime::scheduler::current_thread::CoreGuard::block_on
             at /Users/andrewlamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/scheduler/current_thread/mod.rs:646:19
  23: tokio::runtime::scheduler::current_thread::CurrentThread::block_on::{{closure}}
             at /Users/andrewlamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/scheduler/current_thread/mod.rs:175:28
  24: tokio::runtime::context::runtime::enter_runtime
             at /Users/andrewlamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/context/runtime.rs:65:16
  25: tokio::runtime::scheduler::current_thread::CurrentThread::block_on
             at /Users/andrewlamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/scheduler/current_thread/mod.rs:167:9
  26: tokio::runtime::runtime::Runtime::block_on
             at /Users/andrewlamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.38.0/src/runtime/runtime.rs:347:47
  27: parquet_exec::parquet::arrow_statistics::test_data_page_stats_with_all_null_page
             at ./tests/parquet/arrow_statistics.rs:517:5
  28: parquet_exec::parquet::arrow_statistics::test_data_page_stats_with_all_null_page::{{closure}}
             at ./tests/parquet/arrow_statistics.rs:516:51
  29: core::ops::function::FnOnce::call_once
             at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/ops/function.rs:250:5
  30: core::ops::function::FnOnce::call_once
             at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/ops/function.rs:250:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

FixedSizeBinaryArray::new(*size, vec![].into(), None)
})
))
let mut builder = FixedSizeBinaryBuilder::new(*size);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks like a nice change to me

@alamb
Copy link
Contributor

alamb commented Jul 7, 2024

Let's merge this one in so we can proceed with getting #11319 ready

@alamb
Copy link
Contributor

alamb commented Jul 7, 2024

THanks again!

@alamb alamb merged commit 6f330c9 into apache:main Jul 7, 2024
23 checks passed
@efredine efredine deleted the fix-handling-of-parquet-data-page-stats-when-all-null branch July 7, 2024 23:19
findepi pushed a commit to findepi/datafusion that referenced this pull request Jul 16, 2024
…e#11295)

* Adds tests for data page statistics when all values on the page are null. Fixes most of the failing tests for iterators not handling this situation correctly.

* Fix handling of data page statistics for FixedBinaryArray using a builder.

* Fix data page all nulls stats test for Dictionary DataType.

* Fixes handling of None statistics for Decimal128 and Decimal256.

* Consolidate make_data_page_stats_iterator uses.

* Fix linting error.

* Remove unnecessary collect.

---------

Co-authored-by: Eric Fredine <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Incorrect statistics extracted from parquet data pages when all values are null
3 participants