GH-39978: [C++][Parquet] Expand BYTE_STREAM_SPLIT to support FIXED_LEN_BYTE_ARRAY, INT32 and INT64 #40094

pitrou · 2024-02-15T17:23:20Z

What changes are included in this PR?

Implement the format addition described in https://issues.apache.org/jira/browse/PARQUET-2414 .

Are these changes tested?

Yes.

Are there any user-facing changes?

Yes (additional types supported for Parquet encoding).

GitHub Issue: [C++][Parquet] Expand BYTE_STREAM_SPLIT to support FIXED_LEN_BYTE_ARRAY, INT32 and INT64 #39978

pitrou · 2024-02-15T17:37:35Z

@github-actions crossbow submit -g cpp

pitrou · 2024-02-15T18:23:17Z

The i386 test failure is related.

mapleFU · 2024-02-17T08:21:05Z

cpp/src/parquet/encoding.cc

-  void PutSpaced(const T* src, int num_values, const uint8_t* valid_bits,
-                 int64_t valid_bits_offset) override;
+  std::shared_ptr<Buffer> FlushValues() override {
+    if (byte_width_ == 1) {


So only special FLBA would touch this case? I'm ok with the code but I guess it's merely touched.

Same question here, wouldn't it be faster to apply PLAIN encoding for single-byte FLBA type?

PlainEncoder<FLBAType> is implemented similarly as this. The only additional step here is ByteStreamSplitEncode, which is skipped for single-byte FLBA type.

mapleFU · 2024-02-17T08:26:27Z

cpp/src/parquet/encoding.cc

-      const ColumnDescriptor* descr,
-      ::arrow::MemoryPool* pool = ::arrow::default_memory_pool());
+  ByteStreamSplitEncoderBase(const ColumnDescriptor* descr, int byte_width,
+                             ::arrow::MemoryPool* pool = ::arrow::default_memory_pool())


Since it's a base class, it's uncessary to have a default pool here?

Not necessarily.

mapleFU · 2024-02-17T08:30:29Z

cpp/src/parquet/encoding.cc

+  void SetData(int num_values, const uint8_t* data, int len) override {
+    if (static_cast<int64_t>(num_values) * byte_width_ != len) {
+      throw ParquetException(
+          "Data size does not match number of values in BYTE_STREAM_SPLIT");


can we print the size, which could be helpful for debugging?

mapleFU · 2024-02-17T08:34:23Z

cpp/src/parquet/encoding.cc

    if (!decode_buffer_ || decode_buffer_->size() < size) {
-      PARQUET_ASSIGN_OR_THROW(decode_buffer_, ::arrow::AllocateBuffer(size));
+      const auto alloc_size = ::arrow::bit_util::NextPower2(size);


Emmm why forcing the NextPower2 here?

So that the decode buffer is resized less often, if the number of values varies a bit every time.

mapleFU · 2024-02-17T08:41:08Z

Sorry for being late because I'm on my Spring Festival vacation previously, this general LGTM, just some minor comments. And do we have benchmark data changing here?

cpp/src/arrow/util/byte_stream_split_test.cc

wgtmac · 2024-02-18T05:54:56Z

cpp/src/parquet/encoding.cc

-  void PutSpaced(const T* src, int num_values, const uint8_t* valid_bits,
-                 int64_t valid_bits_offset) override;
+  std::shared_ptr<Buffer> FlushValues() override {
+    if (byte_width_ == 1) {


Same question here, wouldn't it be faster to apply PLAIN encoding for single-byte FLBA type?

cpp/src/parquet/encoding.cc

pitrou · 2024-02-19T11:00:07Z

@github-actions crossbow submit -g cpp

pitrou · 2024-02-19T11:00:14Z

@ursabot please benchmark

ursabot · 2024-02-19T11:00:21Z

Benchmark runs are scheduled for commit 7a19483. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete.

mapleFU · 2024-02-19T11:03:36Z

cpp/src/parquet/encoding.cc

    if (!decode_buffer_ || decode_buffer_->size() < size) {
-      PARQUET_ASSIGN_OR_THROW(decode_buffer_, ::arrow::AllocateBuffer(size));
+      const auto alloc_size = ::arrow::bit_util::NextPower2(size);
+      PARQUET_ASSIGN_OR_THROW(decode_buffer_, ::arrow::AllocateBuffer(alloc_size));


Just founded that this doesn't has a memory_pool. Should we having that here?

This is a temporary buffer, so I'm not sure.

Oh only DecodeArrow uses this

mapleFU

Would waiting for benchmark result

pitrou · 2024-02-19T12:23:47Z

I still need to fix the i386 failure.

pitrou · 2024-02-19T13:04:38Z

@github-actions crossbow submit -g cpp

conbench-apache-arrow · 2024-02-19T15:34:22Z

Thanks for your patience. Conbench analyzed the 6 benchmarking runs that have been run so far on PR commit 7a19483.

There were 43 benchmark results indicating a performance regression:

Pull Request Run on arm64-t4g-linux-compute at 2024-02-19 12:48:53Z
- BM_DeltaLengthDecodingSpacedByteArray (C++) with params=max-string-length:1024/batch-size:512, source=cpp-micro, suite=parquet-encoding-benchmark
- BM_ReadInt64Column (C++) with params=<Repetition::REPEATED, Compression::SNAPPY>/65536/1024, source=cpp-micro, suite=parquet-column-io-benchmark
and 41 more (see the report linked below)

The full Conbench report has more details.

pitrou · 2024-02-19T16:06:38Z

The regressions above are because I made data generation more realistic and therefore less trivially compressible:
https://github.com/apache/arrow/pull/40094/files#diff-e0338b1864949c8001485874b6249a5a756c83c9671e4a31ee523289035487e5R171

pitrou · 2024-02-29T19:28:05Z

Also, I've generated a test file here: apache/parquet-testing#46

…XED_LEN_BYTE_ARRAY, INT32 and INT64

pitrou · 2024-03-18T13:40:22Z

@github-actions crossbow submit -g cpp

pitrou · 2024-03-18T13:57:28Z

@github-actions crossbow submit -g cpp

github-actions · 2024-03-18T14:00:17Z

Revision: 93ebd84

Submitted crossbow builds: ursacomputing/crossbow @ actions-f4c9c5c778

Task	Status
test-alpine-linux-cpp
test-build-cpp-fuzz
test-conda-cpp
test-conda-cpp-valgrind
test-cuda-cpp
test-debian-12-cpp-amd64
test-debian-12-cpp-i386
test-fedora-39-cpp
test-ubuntu-20.04-cpp
test-ubuntu-20.04-cpp-bundled
test-ubuntu-20.04-cpp-minimal-with-formats
test-ubuntu-20.04-cpp-thread-sanitizer
test-ubuntu-22.04-cpp
test-ubuntu-22.04-cpp-20
test-ubuntu-22.04-cpp-no-threading
test-ubuntu-24.04-cpp
test-ubuntu-24.04-cpp-gcc-14

wgtmac

+1

mapleFU · 2024-03-19T05:06:17Z

cpp/src/arrow/util/byte_stream_split_internal.h

 #endif
+  switch (width) {
+    case 1:
+      memcpy(out, raw_values, num_values);


I'm ok with this, but seems it's equal to PLAIN 🤔?

Well, yes, by definition.

mapleFU · 2024-03-19T05:41:52Z

cpp/src/arrow/util/byte_stream_split_internal.h

  }
  DoSplitStreams(raw_values, kNumStreams, num_values, dest_streams.data());
 }

+inline void ByteStreamSplitEncodeScalarDynamic(const uint8_t* raw_values, int width,
+                                               const int64_t num_values, uint8_t* out) {
+  ::arrow::internal::SmallVector<uint8_t*, 16> dest_streams;


Would this benifits performance?

Potentially, though perhaps not a in a micro-benchmark where allocations are reused efficiently.

mapleFU

This is LGTM, below are some minor questions

conbench-apache-arrow · 2024-03-19T15:46:17Z

After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit a364e4a.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 62 possible false positives for unstable benchmarks that are known to sometimes produce them.

…amSplitDecoder (#41565) ### Rationale for this change This problem is raised from #40094 . Original bug fixed here: #34140 , but this is corrupt in #40094 . ### What changes are included in this PR? Refine checking ### Are these changes tested? * [x] Will add ### Are there any user-facing changes? Bugfix * GitHub Issue: #41562 Authored-by: mwish <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>

…rite_table() docstring (#41759) ### Rationale for this change In PR #40094 (issue GH-39978), we forgot to update the `write_table` docstring with an accurate description of the supported data types for BYTE_STREAM_SPLIT. ### Are these changes tested? No (only a doc change). ### Are there any user-facing changes? No. * GitHub Issue: #41748 Authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Joris Van den Bossche <[email protected]>

…teStreamSplitDecoder (apache#41565) ### Rationale for this change This problem is raised from apache#40094 . Original bug fixed here: apache#34140 , but this is corrupt in apache#40094 . ### What changes are included in this PR? Refine checking ### Are these changes tested? * [x] Will add ### Are there any user-facing changes? Bugfix * GitHub Issue: apache#41562 Authored-by: mwish <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>

…n in write_table() docstring (apache#41759) ### Rationale for this change In PR apache#40094 (issue apacheGH-39978), we forgot to update the `write_table` docstring with an accurate description of the supported data types for BYTE_STREAM_SPLIT. ### Are these changes tested? No (only a doc change). ### Are there any user-facing changes? No. * GitHub Issue: apache#41748 Authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Joris Van den Bossche <[email protected]>

…ders (#15832) BYTE_STREAM_SPLIT encoding was recently added to cuDF (#15311). The Parquet specification was recently changed (apache/parquet-format#229) to extend the datatypes that can be encoded as BYTE_STREAM_SPLIT, and this was only recently implemented in arrow (apache/arrow#40094). This PR adds a check that cuDF and arrow can produce compatible files using BYTE_STREAM_SPLIT encoding. Authors: - Ed Seidl (https://github.com/etseidl) Approvers: - Lawrence Mitchell (https://github.com/wence-) URL: #15832

github-actions bot added Component: Parquet Component: C++ Component: Python awaiting review Awaiting review labels Feb 15, 2024

This comment was marked as outdated.

Sign in to view

pitrou marked this pull request as ready for review February 15, 2024 17:45

pitrou requested a review from wgtmac as a code owner February 15, 2024 17:45

pitrou requested a review from mapleFU February 15, 2024 17:47

mapleFU reviewed Feb 17, 2024

View reviewed changes

github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Feb 17, 2024

wgtmac reviewed Feb 18, 2024

View reviewed changes

pitrou force-pushed the gh39978-byte-stream-split branch from bb540f3 to 7a19483 Compare February 19, 2024 10:59

This comment was marked as outdated.

Sign in to view

mapleFU reviewed Feb 19, 2024

View reviewed changes

mapleFU approved these changes Feb 19, 2024

View reviewed changes

This comment was marked as outdated.

Sign in to view

pitrou changed the title ~~GH-39978: [C++][Parquet] Expand BYTE_STREAM_SPLIT to support FIXED_LEN_BYTE_ARRAY, INT32 and INT64~~ DO NOT MERGE: GH-39978: [C++][Parquet] Expand BYTE_STREAM_SPLIT to support FIXED_LEN_BYTE_ARRAY, INT32 and INT64 Feb 19, 2024

pitrou mentioned this pull request Feb 29, 2024

PARQUET-2414: Extend BYTE_STREAM_SPLIT to support INT32, INT64 and FIXED_LEN_BYTE_ARRAY data apache/parquet-format#229

Merged

apacheGH-39978: [C++][Parquet] Expand BYTE_STREAM_SPLIT to support FI…

b8bd382

…XED_LEN_BYTE_ARRAY, INT32 and INT64

pitrou force-pushed the gh39978-byte-stream-split branch from 771ad51 to b8bd382 Compare March 18, 2024 12:47

Add integration test

94626c4

pitrou changed the title ~~DO NOT MERGE: GH-39978: [C++][Parquet] Expand BYTE_STREAM_SPLIT to support FIXED_LEN_BYTE_ARRAY, INT32 and INT64~~ GH-39978: [C++][Parquet] Expand BYTE_STREAM_SPLIT to support FIXED_LEN_BYTE_ARRAY, INT32 and INT64 Mar 18, 2024

This comment was marked as outdated.

Sign in to view

Fix failures

93ebd84

pitrou requested a review from wgtmac March 18, 2024 15:58

wgtmac approved these changes Mar 19, 2024

View reviewed changes

mapleFU reviewed Mar 19, 2024

View reviewed changes

mapleFU approved these changes Mar 19, 2024

View reviewed changes

pitrou merged commit a364e4a into apache:main Mar 19, 2024
42 checks passed

pitrou removed the awaiting committer review Awaiting committer review label Mar 19, 2024

pitrou mentioned this pull request Mar 19, 2024

[C++][Parquet] Expand BYTE_STREAM_SPLIT to support FIXED_LEN_BYTE_ARRAY, INT32 and INT64 #39978

Closed

pitrou deleted the gh39978-byte-stream-split branch March 19, 2024 11:07

mapleFU mentioned this pull request May 7, 2024

GH-41562: [C++][Parquet] Decoding: Fix num_value handling in ByteStreamSplitDecoder #41565

Merged

1 task

benibus mentioned this pull request May 13, 2024

[Go][Parquet] Expand BYTE_STREAM_SPLIT to support FIXED_LEN_BYTE_ARRAY, INT32 and INT64 #41640

Closed

pitrou mentioned this pull request May 21, 2024

GH-41748: [Python][Parquet] Update BYTE_STREAM_SPLIT description in write_table() docstring #41759

Merged

etseidl mentioned this pull request May 22, 2024

Add test of interoperability of cuDF and arrow BYTE_STREAM_SPLIT encoders rapidsai/cudf#15832

Merged

3 tasks

anjakefala mentioned this pull request Jul 12, 2024

Extend support for BYTE_STREAM_SPLIT to FIXED_LEN_BYTE_ARRAY, INT32, and INT64 primitive types apache/arrow-rs#6048

Closed

GH-39978: [C++][Parquet] Expand BYTE_STREAM_SPLIT to support FIXED_LEN_BYTE_ARRAY, INT32 and INT64 #40094

GH-39978: [C++][Parquet] Expand BYTE_STREAM_SPLIT to support FIXED_LEN_BYTE_ARRAY, INT32 and INT64 #40094

Conversation

pitrou commented Feb 15, 2024 • edited Loading

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

pitrou commented Feb 15, 2024

This comment was marked as outdated.

pitrou commented Feb 15, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mapleFU commented Feb 17, 2024 • edited Loading

Choose a reason for hiding this comment

pitrou commented Feb 19, 2024

pitrou commented Feb 19, 2024

ursabot commented Feb 19, 2024

This comment was marked as outdated.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mapleFU left a comment

Choose a reason for hiding this comment

pitrou commented Feb 19, 2024

pitrou commented Feb 19, 2024

This comment was marked as outdated.

conbench-apache-arrow bot commented Feb 19, 2024

pitrou commented Feb 19, 2024

pitrou commented Feb 29, 2024

pitrou commented Mar 18, 2024

This comment was marked as outdated.

pitrou commented Mar 18, 2024

github-actions bot commented Mar 18, 2024

wgtmac left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mapleFU left a comment

Choose a reason for hiding this comment

conbench-apache-arrow bot commented Mar 19, 2024

pitrou commented Feb 15, 2024 •

edited

Loading

mapleFU commented Feb 17, 2024 •

edited

Loading