Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-39978: [C++][Parquet] Expand BYTE_STREAM_SPLIT to support FIXED_LEN_BYTE_ARRAY, INT32 and INT64 #40094

Merged
merged 3 commits into from
Mar 19, 2024

Conversation

pitrou
Copy link
Member

@pitrou pitrou commented Feb 15, 2024

What changes are included in this PR?

Implement the format addition described in https://issues.apache.org/jira/browse/PARQUET-2414 .

Are these changes tested?

Yes.

Are there any user-facing changes?

Yes (additional types supported for Parquet encoding).

@pitrou
Copy link
Member Author

pitrou commented Feb 15, 2024

@github-actions crossbow submit -g cpp

This comment was marked as outdated.

@pitrou pitrou marked this pull request as ready for review February 15, 2024 17:45
@pitrou pitrou requested a review from wgtmac as a code owner February 15, 2024 17:45
@pitrou pitrou requested a review from mapleFU February 15, 2024 17:47
@pitrou
Copy link
Member Author

pitrou commented Feb 15, 2024

The i386 test failure is related.

void PutSpaced(const T* src, int num_values, const uint8_t* valid_bits,
int64_t valid_bits_offset) override;
std::shared_ptr<Buffer> FlushValues() override {
if (byte_width_ == 1) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So only special FLBA would touch this case? I'm ok with the code but I guess it's merely touched.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question here, wouldn't it be faster to apply PLAIN encoding for single-byte FLBA type?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PlainEncoder<FLBAType> is implemented similarly as this. The only additional step here is ByteStreamSplitEncode, which is skipped for single-byte FLBA type.

const ColumnDescriptor* descr,
::arrow::MemoryPool* pool = ::arrow::default_memory_pool());
ByteStreamSplitEncoderBase(const ColumnDescriptor* descr, int byte_width,
::arrow::MemoryPool* pool = ::arrow::default_memory_pool())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it's a base class, it's uncessary to have a default pool here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessarily.

void SetData(int num_values, const uint8_t* data, int len) override {
if (static_cast<int64_t>(num_values) * byte_width_ != len) {
throw ParquetException(
"Data size does not match number of values in BYTE_STREAM_SPLIT");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we print the size, which could be helpful for debugging?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

if (!decode_buffer_ || decode_buffer_->size() < size) {
PARQUET_ASSIGN_OR_THROW(decode_buffer_, ::arrow::AllocateBuffer(size));
const auto alloc_size = ::arrow::bit_util::NextPower2(size);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Emmm why forcing the NextPower2 here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So that the decode buffer is resized less often, if the number of values varies a bit every time.

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Feb 17, 2024
@mapleFU
Copy link
Member

mapleFU commented Feb 17, 2024

Sorry for being late because I'm on my Spring Festival vacation previously, this general LGTM, just some minor comments. And do we have benchmark data changing here?

cpp/src/arrow/util/byte_stream_split_test.cc Show resolved Hide resolved
void PutSpaced(const T* src, int num_values, const uint8_t* valid_bits,
int64_t valid_bits_offset) override;
std::shared_ptr<Buffer> FlushValues() override {
if (byte_width_ == 1) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question here, wouldn't it be faster to apply PLAIN encoding for single-byte FLBA type?

cpp/src/parquet/encoding.cc Outdated Show resolved Hide resolved
cpp/src/parquet/encoding.cc Outdated Show resolved Hide resolved
@pitrou
Copy link
Member Author

pitrou commented Feb 19, 2024

@github-actions crossbow submit -g cpp

@pitrou
Copy link
Member Author

pitrou commented Feb 19, 2024

@ursabot please benchmark

@ursabot
Copy link

ursabot commented Feb 19, 2024

Benchmark runs are scheduled for commit 7a19483. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete.

This comment was marked as outdated.

if (!decode_buffer_ || decode_buffer_->size() < size) {
PARQUET_ASSIGN_OR_THROW(decode_buffer_, ::arrow::AllocateBuffer(size));
const auto alloc_size = ::arrow::bit_util::NextPower2(size);
PARQUET_ASSIGN_OR_THROW(decode_buffer_, ::arrow::AllocateBuffer(alloc_size));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just founded that this doesn't has a memory_pool. Should we having that here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a temporary buffer, so I'm not sure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh only DecodeArrow uses this

Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would waiting for benchmark result

@pitrou
Copy link
Member Author

pitrou commented Feb 19, 2024

I still need to fix the i386 failure.

@pitrou
Copy link
Member Author

pitrou commented Feb 19, 2024

@github-actions crossbow submit -g cpp

This comment was marked as outdated.

@pitrou pitrou changed the title GH-39978: [C++][Parquet] Expand BYTE_STREAM_SPLIT to support FIXED_LEN_BYTE_ARRAY, INT32 and INT64 DO NOT MERGE: GH-39978: [C++][Parquet] Expand BYTE_STREAM_SPLIT to support FIXED_LEN_BYTE_ARRAY, INT32 and INT64 Feb 19, 2024
Copy link

Thanks for your patience. Conbench analyzed the 6 benchmarking runs that have been run so far on PR commit 7a19483.

There were 43 benchmark results indicating a performance regression:

The full Conbench report has more details.

@pitrou
Copy link
Member Author

pitrou commented Feb 19, 2024

The regressions above are because I made data generation more realistic and therefore less trivially compressible:
https://github.com/apache/arrow/pull/40094/files#diff-e0338b1864949c8001485874b6249a5a756c83c9671e4a31ee523289035487e5R171

@pitrou
Copy link
Member Author

pitrou commented Feb 29, 2024

Also, I've generated a test file here: apache/parquet-testing#46

@pitrou pitrou changed the title DO NOT MERGE: GH-39978: [C++][Parquet] Expand BYTE_STREAM_SPLIT to support FIXED_LEN_BYTE_ARRAY, INT32 and INT64 GH-39978: [C++][Parquet] Expand BYTE_STREAM_SPLIT to support FIXED_LEN_BYTE_ARRAY, INT32 and INT64 Mar 18, 2024
@pitrou
Copy link
Member Author

pitrou commented Mar 18, 2024

@github-actions crossbow submit -g cpp

This comment was marked as outdated.

@pitrou
Copy link
Member Author

pitrou commented Mar 18, 2024

@github-actions crossbow submit -g cpp

Copy link

Revision: 93ebd84

Submitted crossbow builds: ursacomputing/crossbow @ actions-f4c9c5c778

Task Status
test-alpine-linux-cpp GitHub Actions
test-build-cpp-fuzz GitHub Actions
test-conda-cpp GitHub Actions
test-conda-cpp-valgrind Azure
test-cuda-cpp GitHub Actions
test-debian-12-cpp-amd64 GitHub Actions
test-debian-12-cpp-i386 GitHub Actions
test-fedora-39-cpp GitHub Actions
test-ubuntu-20.04-cpp GitHub Actions
test-ubuntu-20.04-cpp-bundled GitHub Actions
test-ubuntu-20.04-cpp-minimal-with-formats GitHub Actions
test-ubuntu-20.04-cpp-thread-sanitizer GitHub Actions
test-ubuntu-22.04-cpp GitHub Actions
test-ubuntu-22.04-cpp-20 GitHub Actions
test-ubuntu-22.04-cpp-no-threading GitHub Actions
test-ubuntu-24.04-cpp GitHub Actions
test-ubuntu-24.04-cpp-gcc-14 GitHub Actions

@pitrou pitrou requested a review from wgtmac March 18, 2024 15:58
Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

#endif
switch (width) {
case 1:
memcpy(out, raw_values, num_values);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with this, but seems it's equal to PLAIN 🤔?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, yes, by definition.

}
DoSplitStreams(raw_values, kNumStreams, num_values, dest_streams.data());
}

inline void ByteStreamSplitEncodeScalarDynamic(const uint8_t* raw_values, int width,
const int64_t num_values, uint8_t* out) {
::arrow::internal::SmallVector<uint8_t*, 16> dest_streams;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this benifits performance?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potentially, though perhaps not a in a micro-benchmark where allocations are reused efficiently.

Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is LGTM, below are some minor questions

@pitrou pitrou merged commit a364e4a into apache:main Mar 19, 2024
42 checks passed
@pitrou pitrou removed the awaiting committer review Awaiting committer review label Mar 19, 2024
@pitrou pitrou deleted the gh39978-byte-stream-split branch March 19, 2024 11:07
Copy link

After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit a364e4a.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 62 possible false positives for unstable benchmarks that are known to sometimes produce them.

pitrou pushed a commit that referenced this pull request May 7, 2024
…amSplitDecoder (#41565)

### Rationale for this change

This problem is raised from  #40094 . Original bug fixed here: #34140 , but this is corrupt in #40094 .

### What changes are included in this PR?

Refine checking

### Are these changes tested?

* [x] Will add

### Are there any user-facing changes?

Bugfix

* GitHub Issue: #41562

Authored-by: mwish <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
raulcd pushed a commit that referenced this pull request May 8, 2024
…amSplitDecoder (#41565)

### Rationale for this change

This problem is raised from  #40094 . Original bug fixed here: #34140 , but this is corrupt in #40094 .

### What changes are included in this PR?

Refine checking

### Are these changes tested?

* [x] Will add

### Are there any user-facing changes?

Bugfix

* GitHub Issue: #41562

Authored-by: mwish <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
jorisvandenbossche pushed a commit that referenced this pull request May 22, 2024
…rite_table() docstring (#41759)

### Rationale for this change

In PR #40094 (issue GH-39978), we forgot to update the `write_table` docstring with an accurate description of the supported data types for BYTE_STREAM_SPLIT.

### Are these changes tested?

No (only a doc change).

### Are there any user-facing changes?

No.
* GitHub Issue: #41748

Authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Joris Van den Bossche <[email protected]>
vibhatha pushed a commit to vibhatha/arrow that referenced this pull request May 25, 2024
…teStreamSplitDecoder (apache#41565)

### Rationale for this change

This problem is raised from  apache#40094 . Original bug fixed here: apache#34140 , but this is corrupt in apache#40094 .

### What changes are included in this PR?

Refine checking

### Are these changes tested?

* [x] Will add

### Are there any user-facing changes?

Bugfix

* GitHub Issue: apache#41562

Authored-by: mwish <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
vibhatha pushed a commit to vibhatha/arrow that referenced this pull request May 25, 2024
…n in write_table() docstring (apache#41759)

### Rationale for this change

In PR apache#40094 (issue apacheGH-39978), we forgot to update the `write_table` docstring with an accurate description of the supported data types for BYTE_STREAM_SPLIT.

### Are these changes tested?

No (only a doc change).

### Are there any user-facing changes?

No.
* GitHub Issue: apache#41748

Authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Joris Van den Bossche <[email protected]>
rapids-bot bot pushed a commit to rapidsai/cudf that referenced this pull request Jun 24, 2024
…ders (#15832)

BYTE_STREAM_SPLIT encoding was recently added to cuDF (#15311). The Parquet specification was recently changed (apache/parquet-format#229) to extend the datatypes that can be encoded as BYTE_STREAM_SPLIT, and this was only recently implemented in arrow (apache/arrow#40094). This PR adds a check that cuDF and arrow can produce compatible files using BYTE_STREAM_SPLIT encoding.

Authors:
  - Ed Seidl (https://github.com/etseidl)

Approvers:
  - Lawrence Mitchell (https://github.com/wence-)

URL: #15832
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants