Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prune columns are all null in ParquetExec by row_counts , handle IS NOT NULL #9989

Merged
merged 7 commits into from
Apr 10, 2024
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -342,8 +342,10 @@ impl<'a> PruningStatistics for RowGroupPruningStatistics<'a> {
scalar.to_array().ok()
}

fn row_counts(&self, _column: &Column) -> Option<ArrayRef> {
None
fn row_counts(&self, column: &Column) -> Option<ArrayRef> {
let (c, _) = self.column(&column.name)?;
let scalar = ScalarValue::UInt64(Some(c.num_values() as u64));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

@Ted-Jiang Ted-Jiang Apr 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scalar.to_array().ok()
}

fn contained(
Expand Down Expand Up @@ -1026,15 +1028,17 @@ mod tests {
column_statistics: Vec<ParquetStatistics>,
) -> RowGroupMetaData {
let mut columns = vec![];
let number_row = 1000;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before all unit test set each col with default 0 row, will trigger num_rows == num_nulls

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering if this could cause problems in real files (for example, if the row counts were not included in the statistics that were written into the file).

However, I double checked the code and it seems like ColumnChunkMetaData::num_values is non nullable so I think we are good.

for (i, s) in column_statistics.iter().enumerate() {
let column = ColumnChunkMetaData::builder(schema_descr.column(i))
.set_statistics(s.clone())
.set_num_values(number_row)
.build()
.unwrap();
columns.push(column);
}
RowGroupMetaData::builder(schema_descr.clone())
.set_num_rows(1000)
.set_num_rows(number_row)
.set_total_byte_size(2000)
.set_column_metadata(columns)
.build()
Expand Down
30 changes: 30 additions & 0 deletions datafusion/core/tests/parquet/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ use arrow::{
record_batch::RecordBatch,
util::pretty::pretty_format_batches,
};
use arrow_array::new_null_array;
use chrono::{Datelike, Duration, TimeDelta};
use datafusion::{
datasource::{physical_plan::ParquetExec, provider_as_source, TableProvider},
Expand Down Expand Up @@ -75,6 +76,7 @@ enum Scenario {
DecimalLargePrecisionBloomFilter,
ByteArray,
PeriodsInColumnNames,
AllNullValues,
Ted-Jiang marked this conversation as resolved.
Show resolved Hide resolved
}

enum Unit {
Expand Down Expand Up @@ -630,6 +632,27 @@ fn make_names_batch(name: &str, service_name_values: Vec<&str>) -> RecordBatch {
RecordBatch::try_new(schema, vec![Arc::new(name), Arc::new(service_name)]).unwrap()
}

/// Return record batch with i8, i16, i32, and i64 sequences with all Null values
fn make_all_null_values() -> RecordBatch {
let schema = Arc::new(Schema::new(vec![
Field::new("i8", DataType::Int8, true),
Field::new("i16", DataType::Int16, true),
Field::new("i32", DataType::Int32, true),
Field::new("i64", DataType::Int64, true),
]));

RecordBatch::try_new(
schema,
vec![
new_null_array(&DataType::Int8, 5),
new_null_array(&DataType::Int16, 5),
new_null_array(&DataType::Int32, 5),
new_null_array(&DataType::Int64, 5),
],
)
.unwrap()
}

fn create_data_batch(scenario: Scenario) -> Vec<RecordBatch> {
match scenario {
Scenario::Timestamps => {
Expand Down Expand Up @@ -799,6 +822,13 @@ fn create_data_batch(scenario: Scenario) -> Vec<RecordBatch> {
),
]
}
Scenario::AllNullValues => {
vec![
make_all_null_values(),
make_int_batches(1, 6),
make_all_null_values(),
]
}
}
}

Expand Down
34 changes: 34 additions & 0 deletions datafusion/core/tests/parquet/row_group_pruning.rs
Original file line number Diff line number Diff line change
Expand Up @@ -1264,3 +1264,37 @@ async fn prune_periods_in_column_names() {
.test_row_group_prune()
.await;
}

#[tokio::test]
async fn test_row_group_all_null_values() {
// Tree row groups:
Ted-Jiang marked this conversation as resolved.
Show resolved Hide resolved
// 1. all Null values
// 2. values from 1 to 5
// 3. all Null values

// After pruning, only row group 2 should be selected
RowGroupPruningTest::new()
.with_scenario(Scenario::AllNullValues)
.with_query("SELECT * FROM t WHERE \"i8\" <= 5")
.with_expected_errors(Some(0))
.with_matched_by_stats(Some(1))
.with_pruned_by_stats(Some(2))
.with_expected_rows(5)
.with_matched_by_bloom_filter(Some(0))
.with_pruned_by_bloom_filter(Some(0))
.test_row_group_prune()
.await;

// After pruning, only row group 1,3 should be selected
RowGroupPruningTest::new()
.with_scenario(Scenario::AllNullValues)
.with_query("SELECT * FROM t WHERE \"i8\" is Null")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also add a tests:

  1. i16 IS NOT NULL (to cover the opposite)
  2. i32 > 7 (prune via nulls and some via counts)

Copy link
Member Author

@Ted-Jiang Ted-Jiang Apr 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb thanks! Add test in 11567d9, and support the isNotNull

Do you plan to add support in page_filter.rs as well (maybe that is why the PR is marked "Part
#9961 ")?

As the page level prune i prefer in next pr to keep this pr short and clean.

.with_expected_errors(Some(0))
.with_matched_by_stats(Some(2))
.with_pruned_by_stats(Some(1))
.with_expected_rows(10)
.with_matched_by_bloom_filter(Some(0))
.with_pruned_by_bloom_filter(Some(0))
.test_row_group_prune()
.await;
}
Loading