Support parquet page filtering on min_max for `decimal128` and `string` columns #4255

Ted-Jiang · 2022-11-17T06:04:28Z

Which issue does this PR close?

Related #3833.

Rationale for this change

Support page index filter on min_max for type decimal and string.

The string code is from @alamb Thanks!

Only cast to Decimal128Array align with row_group prunning.

As null_count, i prefer fix in next pr.

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Signed-off-by: yangjiang <[email protected]>

datafusion/core/src/physical_plan/file_format/parquet/page_filter.rs

Ted-Jiang · 2022-11-17T06:08:06Z

datafusion/core/src/physical_plan/file_format/parquet.rs

+// Convert the bytes array to i128.
+// The endian of the input bytes array must be big-endian.
+// Copy from the arrow-rs
+pub(crate) fn from_bytes_to_i128(b: &[u8]) -> i128 {


Move the common func here.

alamb

Thank you @Ted-Jiang -- this looks great

I wonder if it would be possible to add some more targeted testing for the string and decimal page indexes in https://github.com/apache/arrow-datafusion/blob/master/datafusion/core/tests/parquet/page_pruning.rs

The current test in parquet_exec I think ensures that the plumbing is all hooked up correctly, but I think some more targeted testing would be good too

However, overall I think this PR could also go in as is. Thanks a lot!

alamb · 2022-11-17T16:59:48Z

datafusion/core/tests/parquet/filter_pushdown.rs

@@ -266,20 +266,17 @@ async fn single_file_small_data_pages() {
    // page 3:                                     DLE:RLE RLE:RLE VLE:RLE_DICTIONARY ST:[min: djzdyiecnumrsrcbizwlqzdhnpoiqdh, max: fktdcgtmzvoedpwhfevcvvrtaurzgex, num_nulls not defined] CRC:[none] SZ:7 VC:9216
    // page 4:                                     DLE:RLE RLE:RLE VLE:RLE_DICTIONARY ST:[min: fktdcgtmzvoedpwhfevcvvrtaurzgex, max: fwtdpgtxwqkkgtgvthhwycrvjiizdifyp, num_nulls not defined] CRC:[none] SZ:7 VC:9216
    // page 5:                                     DLE:RLE RLE:RLE VLE:RLE_DICTIONARY ST:[min: fwtdpgtxwqkkgtgvthhwycrvjiizdifyp, max: iadnalqpdzthpifrvewossmpqibgtsuin, num_nulls not defined] CRC:[none] SZ:7 VC:7739
-    //


alamb · 2022-11-17T17:01:56Z

test-utils/src/data_gen.rs

@@ -146,6 +148,7 @@ impl BatchBuilder {
            .append_option(rng.gen_bool(0.9).then(|| rng.gen()));
        self.response_status
            .append_value(status[rng.gen_range(0..status.len())]);
+        self.prices_status.append_value(self.row_count as i128);


the incrementing price makes sense for range testing

liukun4515 · 2022-11-18T03:05:14Z

datafusion/core/src/physical_plan/file_format/parquet/page_filter.rs

@@ -382,6 +394,9 @@ fn create_row_count_in_each_page(
 struct PagesPruningStatistics<'a> {
    col_page_indexes: &'a Index,
    col_offset_indexes: &'a Vec<PageLocation>,
+    // target_type means the logical type in schema: like 'DECIMAL' is the logical type, but the


…ter.rs Co-authored-by: Andrew Lamb <[email protected]>

liukun4515 · 2022-11-18T03:09:37Z

datafusion/core/src/physical_plan/file_format/parquet/page_filter.rs

@@ -419,10 +468,37 @@ macro_rules! get_min_max_values_for_page_index {
                    vec.iter().map(|x| x.$func().cloned()),
                )))
            }
-            Index::INT96(_) | Index::BYTE_ARRAY(_) | Index::FIXED_LEN_BYTE_ARRAY(_) => {
+            Index::BYTE_ARRAY(index) => {
+                let vec = &index.indexes;


decimal should be supported for this logical type.

Arrow-rs contains the method of decoding decimal from byte array in ByteArrayReader

Thanks, i prefer align with row group, do them together in other pr.

Addition additional support in a follow on PR sounds like a good idea to me -- maybe we can file a ticket to track the work

liukun4515 · 2022-11-18T03:20:47Z

Thank you @Ted-Jiang -- this looks great

I wonder if it would be possible to add some more targeted testing for the string and decimal page indexes in https://github.com/apache/arrow-datafusion/blob/master/datafusion/core/tests/parquet/page_pruning.rs

The current test in parquet_exec I think ensures that the plumbing is all hooked up correctly, but I think some more targeted testing would be good too

However, overall I think this PR could also go in as is. Thanks a lot!

agree.

@Ted-Jiang
We can add more test for this, and use different physical data type with different decimal data type(diff precision and scale are better)

Signed-off-by: yangjiang <[email protected]>

Ted-Jiang · 2022-11-18T13:07:03Z

will add more test

Signed-off-by: yangjiang <[email protected]>

Ted-Jiang · 2022-11-18T15:36:47Z

@alamb @liukun4515 Add types check same as in row_group for page-index pruning.

reorg test code: code refactoring avoid duplicate code in test.
add test for page index: add same test case for page index.

Some test are ignore, i think there are some bug with complex_expr 🤔 will fix in next pr(not have a clue now)

Ted-Jiang · 2022-11-18T15:38:16Z

datafusion/core/tests/parquet/row_group_pruning.rs

@@ -503,465 +483,3 @@ async fn prune_decimal_in_list() {
    )
    .await;
 }
-


code move to mod.rs

Ted-Jiang · 2022-11-18T15:38:57Z

datafusion/core/tests/parquet/page_pruning.rs

+}
+
+#[tokio::test]
+#[ignore]


All test case with expr fail 😭

I wonder if we have to run "type coercion / simplifiction" on them first?

Did rowGroup run this "type coercion / simplifiction" 🤔 ? I think they are the same code path, i will find it out soon.

@Ted-Jiang Does the test_prune function not run the optimizer?

File related #4317

alamb · 2022-11-20T12:13:46Z

I plan to review this carefully tomorrow again --sorry for the delay @Ted-Jiang

alamb

I think this is looking good @Ted-Jiang -- the only thing I think that should be addressed before merging is the use of min() rather than$func() -- and that may just be my misunderstanding.

Once that is sorted out, perhaps we then get this PR in and then iterate on additional changes / data type support as follow ons?

alamb · 2022-11-21T10:54:24Z

datafusion/core/src/physical_plan/file_format/parquet/page_filter.rs

+                        let vec = &index.indexes;
+                        let vec: Vec<Option<i128>> = vec
+                            .iter()
+                            .map(|x| x.min().and_then(|x| Some(*x as i128)))


I wonder if this this be $x.$func() rather than x.min()?

Your are right! i forgot change it writing the macro, real surprise ut not cover this 😂

i will add a greater than test case.

alamb · 2022-11-21T10:54:40Z

datafusion/core/src/physical_plan/file_format/parquet/page_filter.rs

+                        let vec = &index.indexes;
+                        let vec: Vec<Option<i128>> = vec
+                            .iter()
+                            .map(|x| x.min().and_then(|x| Some(*x as i128)))


same question here -- should this be x.$func() rather than x.min()?

alamb · 2022-11-21T10:56:07Z

datafusion/core/src/physical_plan/file_format/parquet/page_filter.rs

+                        if let Ok(arr) = Decimal128Array::from(vec)
+                            .with_precision_and_scale(*precision, *scale)
+                        {
+                            return Some(Arc::new(arr));
+                        } else {
+                            return None;
+                        }


You might be able to this more functionally with something like (untested):

Suggested change

if let Ok(arr) = Decimal128Array::from(vec)

.with_precision_and_scale(*precision, *scale)

{

return Some(Arc::new(arr));

} else {

return None;

}

Decimal128Array::from(vec)

.with_precision_and_scale(*precision, *scale)

.ok()

.map(|arr| Arc::new(arr))

Nice suggestion ! Some much api need remember deal with option and result 😂

alamb · 2022-11-21T10:57:00Z

datafusion/core/src/physical_plan/file_format/parquet/page_filter.rs

@@ -419,10 +468,37 @@ macro_rules! get_min_max_values_for_page_index {
                    vec.iter().map(|x| x.$func().cloned()),
                )))
            }
-            Index::INT96(_) | Index::BYTE_ARRAY(_) | Index::FIXED_LEN_BYTE_ARRAY(_) => {
+            Index::BYTE_ARRAY(index) => {
+                let vec = &index.indexes;


Addition additional support in a follow on PR sounds like a good idea to me -- maybe we can file a ticket to track the work

alamb · 2022-11-21T15:00:44Z

datafusion/core/tests/parquet/page_pruning.rs

@@ -204,3 +222,466 @@ async fn page_index_filter_multi_col() {
    let batch = results.next().await.unwrap().unwrap();
    assert_eq!(batch.num_rows(), 7300);
 }
+
+async fn test_prune(


This is great coverage -- thanks @Ted-Jiang. It is somewhat repetitive with the row group pruning but I think that is ok as they are different code paths

alamb · 2022-11-21T15:01:26Z

datafusion/core/tests/parquet/page_pruning.rs

+}
+
+#[tokio::test]
+#[ignore]


I wonder if we have to run "type coercion / simplifiction" on them first?

Did rowGroup run this "type coercion / simplifiction" 🤔 ? I think they are the same code path, i will find it out soon.

alamb · 2022-11-21T15:06:28Z

datafusion/core/tests/parquet/page_pruning.rs

+}
+
+#[tokio::test]
+async fn prune_decimal_eq() {


it might be worth another test that prunes something other than 5 rows -- maybe where decimal_col = 30.00 and prunes out the other pages? All of the tests here seem to prune out only the third page 20.00 -> 60.00

xudong963 · 2022-11-22T01:20:04Z

I'll take a look later :)

Signed-off-by: yangjiang <[email protected]>

alamb · 2022-11-22T12:06:59Z

I think this is ready to go in now -- thank you @xudong963 and @Ted-Jiang and @liukun4515

ursabot · 2022-11-22T12:12:02Z

Benchmark runs are scheduled for baseline = eac254c and contender = d7a7fb6. d7a7fb6 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

alamb and others added 2 commits November 17, 2022 13:31

Support parquet page filtering for string columns

8a79f47

Signed-off-by: yangjiang <[email protected]>

Support parquet page filtering on min_max for decimal128 columns

fc84754

Signed-off-by: yangjiang <[email protected]>

github-actions bot added the core Core DataFusion crate label Nov 17, 2022

Ted-Jiang commented Nov 17, 2022

View reviewed changes

datafusion/core/src/physical_plan/file_format/parquet/page_filter.rs Outdated Show resolved Hide resolved

Ted-Jiang commented Nov 17, 2022

View reviewed changes

Ted-Jiang requested review from alamb and liukun4515 November 17, 2022 06:08

alamb approved these changes Nov 17, 2022

View reviewed changes

liukun4515 reviewed Nov 18, 2022

View reviewed changes

Update datafusion/core/src/physical_plan/file_format/parquet/page_fil…

a719dff

…ter.rs Co-authored-by: Andrew Lamb <[email protected]>

liukun4515 reviewed Nov 18, 2022

View reviewed changes

Avoid unwarp

4e29644

Signed-off-by: yangjiang <[email protected]>

Ted-Jiang added 2 commits November 18, 2022 22:54

reorg test code

5b6c478

Signed-off-by: yangjiang <[email protected]>

add test for page index

4c81dca

Signed-off-by: yangjiang <[email protected]>

Ted-Jiang commented Nov 18, 2022

View reviewed changes

alamb reviewed Nov 21, 2022

View reviewed changes

Ted-Jiang added 2 commits November 22, 2022 11:53

fix commet

d0bea7b

Signed-off-by: yangjiang <[email protected]>

remove code

9ebe563

Signed-off-by: yangjiang <[email protected]>

Ted-Jiang mentioned this pull request Nov 22, 2022

Page index pruning fail on complex_expr #4317

Closed

xudong963 approved these changes Nov 22, 2022

View reviewed changes

alamb mentioned this pull request Nov 22, 2022

[EPIC] Parquet filter pushdown into scan #3462

Open

27 tasks

alamb approved these changes Nov 22, 2022

View reviewed changes

alamb merged commit d7a7fb6 into apache:master Nov 22, 2022

HaoYang670 mentioned this pull request Nov 25, 2022

Do not log error if page index can not be evaluated #4358

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support parquet page filtering on min_max for `decimal128` and `string` columns #4255

Support parquet page filtering on min_max for `decimal128` and `string` columns #4255

Ted-Jiang commented Nov 17, 2022 •

edited

Loading

Ted-Jiang Nov 17, 2022

alamb left a comment

alamb Nov 17, 2022

alamb Nov 17, 2022

liukun4515 Nov 18, 2022

liukun4515 Nov 18, 2022

liukun4515 Nov 18, 2022

Ted-Jiang Nov 18, 2022

alamb Nov 21, 2022

liukun4515 commented Nov 18, 2022

Ted-Jiang commented Nov 18, 2022

Ted-Jiang commented Nov 18, 2022 •

edited

Loading

Ted-Jiang Nov 18, 2022

Ted-Jiang Nov 18, 2022

alamb Nov 21, 2022 •

edited by Ted-Jiang

Loading

liukun4515 Nov 22, 2022

Ted-Jiang Nov 22, 2022 •

edited

Loading

alamb commented Nov 20, 2022

alamb left a comment

alamb Nov 21, 2022

Ted-Jiang Nov 22, 2022

Ted-Jiang Nov 22, 2022

alamb Nov 21, 2022

alamb Nov 21, 2022

Ted-Jiang Nov 22, 2022

alamb Nov 21, 2022

alamb Nov 21, 2022

alamb Nov 21, 2022 •

edited by Ted-Jiang

Loading

alamb Nov 21, 2022

xudong963 commented Nov 22, 2022 •

edited

Loading

alamb commented Nov 22, 2022 •

edited

Loading

ursabot commented Nov 22, 2022

@@ @@ -503,465 +483,3 @@ async fn prune_decimal_in_list() { @@
                   )
                   .await;
               }

Support parquet page filtering on min_max for decimal128 and string columns #4255

Support parquet page filtering on min_max for decimal128 and string columns #4255

Conversation

Ted-Jiang commented Nov 17, 2022 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liukun4515 commented Nov 18, 2022

Ted-Jiang commented Nov 18, 2022

Ted-Jiang commented Nov 18, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb Nov 21, 2022 • edited by Ted-Jiang Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Ted-Jiang Nov 22, 2022 • edited Loading

Choose a reason for hiding this comment

alamb commented Nov 20, 2022

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb Nov 21, 2022 • edited by Ted-Jiang Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xudong963 commented Nov 22, 2022 • edited Loading

alamb commented Nov 22, 2022 • edited Loading

ursabot commented Nov 22, 2022

Support parquet page filtering on min_max for `decimal128` and `string` columns #4255

Support parquet page filtering on min_max for `decimal128` and `string` columns #4255

Ted-Jiang commented Nov 17, 2022 •

edited

Loading

Ted-Jiang commented Nov 18, 2022 •

edited

Loading

alamb Nov 21, 2022 •

edited by Ted-Jiang

Loading

Ted-Jiang Nov 22, 2022 •

edited

Loading

alamb Nov 21, 2022 •

edited by Ted-Jiang

Loading

xudong963 commented Nov 22, 2022 •

edited

Loading

alamb commented Nov 22, 2022 •

edited

Loading