-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-39978: [C++][Parquet] Expand BYTE_STREAM_SPLIT to support FIXED_LEN_BYTE_ARRAY, INT32 and INT64 #40094
Conversation
@github-actions crossbow submit -g cpp |
This comment was marked as outdated.
This comment was marked as outdated.
The i386 test failure is related. |
void PutSpaced(const T* src, int num_values, const uint8_t* valid_bits, | ||
int64_t valid_bits_offset) override; | ||
std::shared_ptr<Buffer> FlushValues() override { | ||
if (byte_width_ == 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So only special FLBA would touch this case? I'm ok with the code but I guess it's merely touched.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question here, wouldn't it be faster to apply PLAIN encoding for single-byte FLBA type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PlainEncoder<FLBAType>
is implemented similarly as this. The only additional step here is ByteStreamSplitEncode
, which is skipped for single-byte FLBA type.
cpp/src/parquet/encoding.cc
Outdated
const ColumnDescriptor* descr, | ||
::arrow::MemoryPool* pool = ::arrow::default_memory_pool()); | ||
ByteStreamSplitEncoderBase(const ColumnDescriptor* descr, int byte_width, | ||
::arrow::MemoryPool* pool = ::arrow::default_memory_pool()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since it's a base class, it's uncessary to have a default pool here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not necessarily.
cpp/src/parquet/encoding.cc
Outdated
void SetData(int num_values, const uint8_t* data, int len) override { | ||
if (static_cast<int64_t>(num_values) * byte_width_ != len) { | ||
throw ParquetException( | ||
"Data size does not match number of values in BYTE_STREAM_SPLIT"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we print the size, which could be helpful for debugging?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
if (!decode_buffer_ || decode_buffer_->size() < size) { | ||
PARQUET_ASSIGN_OR_THROW(decode_buffer_, ::arrow::AllocateBuffer(size)); | ||
const auto alloc_size = ::arrow::bit_util::NextPower2(size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Emmm why forcing the NextPower2
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So that the decode buffer is resized less often, if the number of values varies a bit every time.
Sorry for being late because I'm on my Spring Festival vacation previously, this general LGTM, just some minor comments. And do we have benchmark data changing here? |
void PutSpaced(const T* src, int num_values, const uint8_t* valid_bits, | ||
int64_t valid_bits_offset) override; | ||
std::shared_ptr<Buffer> FlushValues() override { | ||
if (byte_width_ == 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question here, wouldn't it be faster to apply PLAIN encoding for single-byte FLBA type?
bb540f3
to
7a19483
Compare
@github-actions crossbow submit -g cpp |
@ursabot please benchmark |
Benchmark runs are scheduled for commit 7a19483. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete. |
This comment was marked as outdated.
This comment was marked as outdated.
if (!decode_buffer_ || decode_buffer_->size() < size) { | ||
PARQUET_ASSIGN_OR_THROW(decode_buffer_, ::arrow::AllocateBuffer(size)); | ||
const auto alloc_size = ::arrow::bit_util::NextPower2(size); | ||
PARQUET_ASSIGN_OR_THROW(decode_buffer_, ::arrow::AllocateBuffer(alloc_size)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just founded that this doesn't has a memory_pool. Should we having that here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a temporary buffer, so I'm not sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh only DecodeArrow uses this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would waiting for benchmark result
I still need to fix the i386 failure. |
@github-actions crossbow submit -g cpp |
This comment was marked as outdated.
This comment was marked as outdated.
Thanks for your patience. Conbench analyzed the 6 benchmarking runs that have been run so far on PR commit 7a19483. There were 43 benchmark results indicating a performance regression:
The full Conbench report has more details. |
The regressions above are because I made data generation more realistic and therefore less trivially compressible: |
Also, I've generated a test file here: apache/parquet-testing#46 |
…XED_LEN_BYTE_ARRAY, INT32 and INT64
771ad51
to
b8bd382
Compare
@github-actions crossbow submit -g cpp |
This comment was marked as outdated.
This comment was marked as outdated.
@github-actions crossbow submit -g cpp |
Revision: 93ebd84 Submitted crossbow builds: ursacomputing/crossbow @ actions-f4c9c5c778 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
#endif | ||
switch (width) { | ||
case 1: | ||
memcpy(out, raw_values, num_values); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm ok with this, but seems it's equal to PLAIN 🤔?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, yes, by definition.
} | ||
DoSplitStreams(raw_values, kNumStreams, num_values, dest_streams.data()); | ||
} | ||
|
||
inline void ByteStreamSplitEncodeScalarDynamic(const uint8_t* raw_values, int width, | ||
const int64_t num_values, uint8_t* out) { | ||
::arrow::internal::SmallVector<uint8_t*, 16> dest_streams; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this benifits performance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potentially, though perhaps not a in a micro-benchmark where allocations are reused efficiently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is LGTM, below are some minor questions
After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit a364e4a. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 62 possible false positives for unstable benchmarks that are known to sometimes produce them. |
…amSplitDecoder (#41565) ### Rationale for this change This problem is raised from #40094 . Original bug fixed here: #34140 , but this is corrupt in #40094 . ### What changes are included in this PR? Refine checking ### Are these changes tested? * [x] Will add ### Are there any user-facing changes? Bugfix * GitHub Issue: #41562 Authored-by: mwish <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
…amSplitDecoder (#41565) ### Rationale for this change This problem is raised from #40094 . Original bug fixed here: #34140 , but this is corrupt in #40094 . ### What changes are included in this PR? Refine checking ### Are these changes tested? * [x] Will add ### Are there any user-facing changes? Bugfix * GitHub Issue: #41562 Authored-by: mwish <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
…rite_table() docstring (#41759) ### Rationale for this change In PR #40094 (issue GH-39978), we forgot to update the `write_table` docstring with an accurate description of the supported data types for BYTE_STREAM_SPLIT. ### Are these changes tested? No (only a doc change). ### Are there any user-facing changes? No. * GitHub Issue: #41748 Authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Joris Van den Bossche <[email protected]>
…teStreamSplitDecoder (apache#41565) ### Rationale for this change This problem is raised from apache#40094 . Original bug fixed here: apache#34140 , but this is corrupt in apache#40094 . ### What changes are included in this PR? Refine checking ### Are these changes tested? * [x] Will add ### Are there any user-facing changes? Bugfix * GitHub Issue: apache#41562 Authored-by: mwish <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
…n in write_table() docstring (apache#41759) ### Rationale for this change In PR apache#40094 (issue apacheGH-39978), we forgot to update the `write_table` docstring with an accurate description of the supported data types for BYTE_STREAM_SPLIT. ### Are these changes tested? No (only a doc change). ### Are there any user-facing changes? No. * GitHub Issue: apache#41748 Authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Joris Van den Bossche <[email protected]>
…ders (#15832) BYTE_STREAM_SPLIT encoding was recently added to cuDF (#15311). The Parquet specification was recently changed (apache/parquet-format#229) to extend the datatypes that can be encoded as BYTE_STREAM_SPLIT, and this was only recently implemented in arrow (apache/arrow#40094). This PR adds a check that cuDF and arrow can produce compatible files using BYTE_STREAM_SPLIT encoding. Authors: - Ed Seidl (https://github.com/etseidl) Approvers: - Lawrence Mitchell (https://github.com/wence-) URL: #15832
What changes are included in this PR?
Implement the format addition described in https://issues.apache.org/jira/browse/PARQUET-2414 .
Are these changes tested?
Yes.
Are there any user-facing changes?
Yes (additional types supported for Parquet encoding).