Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve stats convert performance for Binary/String/Boolean arrays #11319

Merged
merged 15 commits into from
Jul 8, 2024

Conversation

Rachelint
Copy link
Contributor

@Rachelint Rachelint commented Jul 7, 2024

Which issue does this PR close?

Closes #11281

Rationale for this change

Improve the stats convert performance as possible.

What changes are included in this PR?

  • Use string builder directly to eliminate some tmp Vecs and Strings(for example, Utf8 and LargeUtf case in both row group or page stats).
  • Eliminate some unnecessary collect to avoid tmp Vec creation

Are these changes tested?

Test by exist tests.

Are there any user-facing changes?

No

@github-actions github-actions bot added the core Core DataFusion crate label Jul 7, 2024
@Rachelint Rachelint changed the title Use Builder to improve stats convert performance Improve stats convert performance Jul 7, 2024
@Rachelint Rachelint marked this pull request as ready for review July 7, 2024 15:36
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Rachelint -- this code looks great. 🚀

I do think this will conflict with #11295 from @efredine

I am running the benchmarks and will post my results shortly

continue;
}

builder.append_value(x).expect("ensure to append successfully here, because size have been checked before");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than using expect it is likely possible to check the value of builder.append_value and if it is an error append null. But perhaps that is overly confusing. This formulation is quite clear and explicit

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than using expect it is likely possible to check the value of builder.append_value and if it is an error append null. But perhaps that is overly confusing. This formulation is quite clear and explicit

🤔 Yes, I append_null before. And I read the codes of append_value after, I found the only reason to return Err is to append a value with different width that has been checked and ensured actually.

.collect::<Vec<_>>()
.into_iter()
.flatten()
// BooleanArray::from_iter required a sized iterator, so collect into Vec first
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -747,10 +770,10 @@ macro_rules! get_data_page_statistics {
Some(DataType::Boolean) => Ok(Arc::new(
BooleanArray::from_iter(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could use BooleanBuilder instead? That would likely be more efficient.

Copy link
Contributor Author

@Rachelint Rachelint Jul 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could use BooleanBuilder instead? That would likely be more efficient.

Seems reasonable! Maybe it can work to remove the collect in BooleanArray case, I try it soon.

@alamb
Copy link
Contributor

alamb commented Jul 7, 2024

I suspect the remaining cases where we are using collect could be made more efficient using the Builder pattern?

I think the reason the Builder is faster for Strings / Binary is that due to how the references worked out, we can avoid copying the strings via value.to_string()

I don't think the Builder pattern is fundamentally better/worse than the from_iter (under the covers they all end up doing the same thing in arrow-rs I think)

@alamb alamb changed the title Improve stats convert performance Improve stats convert performance for Binary/String arrays Jul 7, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran the benchmarks and this looks like a 2x performance improvement for strings and dictionaries 👍

group                                                                                                      improve-page-stats-convert              main
-----                                                                                                      --------------------------              ----
Extract data page statistics for Dictionary(Int32, String)/extract_statistics/Dictionary(Int32, String)    1.00     39.6±3.96µs        ? ?/sec     2.28     90.3±0.20µs        ? ?/sec
Extract data page statistics for F64/extract_statistics/F64                                                1.05     16.7±0.03µs        ? ?/sec     1.00     15.9±0.04µs        ? ?/sec
Extract data page statistics for Int64/extract_statistics/Int64                                            1.00     14.8±0.05µs        ? ?/sec     1.02     15.1±0.47µs        ? ?/sec
Extract data page statistics for String/extract_statistics/String                                          1.00     39.3±4.71µs        ? ?/sec     2.30     90.2±0.29µs        ? ?/sec
Extract data page statistics for UInt64/extract_statistics/UInt64                                          1.01     15.5±0.03µs        ? ?/sec     1.00     15.4±0.04µs        ? ?/sec
Extract row group statistics for Dictionary(Int32, String)/extract_statistics/Dictionary(Int32, String)    1.00   1962.4±3.95ns        ? ?/sec     1.06      2.1±0.01µs        ? ?/sec
Extract row group statistics for F64/extract_statistics/F64                                                1.02   1193.2±7.54ns        ? ?/sec     1.00   1174.4±5.82ns        ? ?/sec
Extract row group statistics for Int64/extract_statistics/Int64                                            1.00   1090.8±3.08ns        ? ?/sec     1.01   1099.2±1.98ns        ? ?/sec
Extract row group statistics for String/extract_statistics/String                                          1.00  1918.0±235.67ns        ? ?/sec    1.07      2.1±0.00µs        ? ?/sec
Extract row group statistics for UInt64/extract_statistics/UInt64                                          1.00   1166.0±2.87ns        ? ?/sec     1.01   1176.4±2.77ns        ? ?/sec

@Rachelint
Copy link
Contributor Author

Rachelint commented Jul 8, 2024

I suspect the remaining cases where we are using collect could be made more efficient using the Builder pattern?

I think the reason the Builder is faster for Strings / Binary is that due to how the references worked out, we can avoid copying the strings via value.to_string()

I don't think the Builder pattern is fundamentally better/worse than the from_iter (under the covers they all end up doing the same thing in arrow-rs I think)

Yes, I tested the Uint64Array case in my POC, use Builder directly seems not obviously faster than use from_iter actually.
https://github.com/Rachelint/arrow-datafusion/blob/70b9f05e737e81c514259e70a8bd2e9f0ad8e725/datafusion/core/src/datasource/physical_plan/parquet/statistics.rs#L790

@Rachelint
Copy link
Contributor Author

Rachelint commented Jul 8, 2024

@alamb Have merged with main, seem ready now.
@efredine @alamb And I made an poc about using Builder instead of collect, performance improved!
see https://github.com/Rachelint/arrow-datafusion/blob/a6f748d0a12559e8c243f2a76ce2c2757bf7c93c/datafusion/core/src/datasource/physical_plan/parquet/statistics.rs#L748-L784

  • use collect
Extract data page statistics for Boolean/extract_statistics/Boolean
                        time:   [11.392 µs 11.403 µs 11.411 µs]
                        change: [-0.2092% -0.0651% +0.1005%] (p = 0.42 > 0.05)
  • use Builder(the builder v2 arm)
Extract data page statistics for Boolean/extract_statistics/Boolean
                        time:   [10.530 µs 10.542 µs 10.553 µs]
                        change: [-18.333% -18.201% -18.087%] (p = 0.00 < 0.05)

I have added such codes about BooleanArray to this pr.

@Rachelint Rachelint changed the title Improve stats convert performance for Binary/String arrays Improve stats convert performance for Binary/String/Boolean arrays Jul 8, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great -- thanks again @Rachelint and @efredine

@alamb alamb merged commit c254b8b into apache:main Jul 8, 2024
24 checks passed
@Rachelint Rachelint deleted the improve-page-stats-convert branch July 8, 2024 10:15
findepi pushed a commit to findepi/datafusion that referenced this pull request Jul 16, 2024
…pache#11319)

* add u64 poc.

* use env to support the quick bench.

* use flatten in builder mode yet.

* add new mode.

* use Builder in Utf8 and LargeUtf8 page level stats' convert.

* use Builder in row group level stats.

* eliminate some unnecessary tmp `Vec`s.

* remove the quick modify in bench.

* process the result return from append_value of FixedSizeBinaryBuilder.

* remove the modification of FixedSizeBinary&Bool, and fix the String case.

* fix and re-enable the modification of FixedSizeBinary.

* fix comments.

* use BooleanBuilder to eliminate the collect in BooleanArray case.

* fix compile.
xinlifoobar pushed a commit to xinlifoobar/datafusion that referenced this pull request Jul 18, 2024
…pache#11319)

* add u64 poc.

* use env to support the quick bench.

* use flatten in builder mode yet.

* add new mode.

* use Builder in Utf8 and LargeUtf8 page level stats' convert.

* use Builder in row group level stats.

* eliminate some unnecessary tmp `Vec`s.

* remove the quick modify in bench.

* process the result return from append_value of FixedSizeBinaryBuilder.

* remove the modification of FixedSizeBinary&Bool, and fix the String case.

* fix and re-enable the modification of FixedSizeBinary.

* fix comments.

* use BooleanBuilder to eliminate the collect in BooleanArray case.

* fix compile.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve performance of DataPage statistics extraction using StringBuilder
3 participants