-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable reading StringViewArray
by default from Parquet
#12092
base: main
Are you sure you want to change the base?
Conversation
StringViewArray
by default from Parquet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arrow-rs skips now the interval parts with 0? so interval tests are broken now
That is due to the arrow upgrade for sure -- you can see the changes needed here (in their own PR): #12032 |
bc5d7f7
to
27f7d1e
Compare
I think the last remaining failure is due to #12123 |
27f7d1e
to
ce5470d
Compare
@@ -264,8 +264,12 @@ impl PruningStatistics for BloomFilterStatistics { | |||
.iter() | |||
.map(|value| { | |||
match value { | |||
ScalarValue::Utf8(Some(v)) => sbbf.check(&v.as_str()), | |||
ScalarValue::Binary(Some(v)) => sbbf.check(v), | |||
ScalarValue::Utf8(Some(v)) | ScalarValue::Utf8View(Some(v)) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this needs its own ticket / test I think. Will do it shortly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed #12499
Still working on benchmarks, but the code is looking good |
2275216
to
e8f7384
Compare
This comment was marked as outdated.
This comment was marked as outdated.
I ran benchmarks and posted results, -- clickbench looks great 🚀 . However, interesting |
55e0dc2
to
6500361
Compare
Benchmarks results clickbench_1:Is slower due to #12771
|
clickbench_extendedLooking good here
|
clickbench_partitionedSlowing down also due to #12509 / #12788
|
Current status: #11682 (comment) |
6500361
to
5b67afe
Compare
My plan is to pull the changes from the following PRs into this PR and rerun the overall perf test
Hopefully that will show the performance we need. Then we just need to get the various PRs actually merged and we can do this 😅 |
5b67afe
to
62a738e
Compare
62a738e
to
dda0a9a
Compare
dda0a9a
to
4a57773
Compare
Draft as it builds on:
53.1.0
/ fix clippy #12724binary_as_string
parquet option #12816 @goldmedalGroupColumn
support for byte view #12809 from @Rachelinthits_partitioned
Which issue does this PR close?
Closes #11682
Rationale for this change
Reading data as
StringViewArray
is significantly faster thanStringArray
. We have been testing this behind a feature flag but it is now stable enough to enable by default.See blog post #11603:
Benchmark Results
What changes are included in this PR?
schema_force_view_types
to trueAre these changes tested?
Yes, by CI tests
Are there any user-facing changes?
If you see an error related to StringView use, you can disable this feature using the schema_force_string_view option
Context
@XiangpengHao debugged these tests previously using #11862