Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-36939: [C++][Parquet] Direct put of BooleanArray is incorrect when called several times #36972

Merged
merged 13 commits into from
Aug 10, 2023

Conversation

mapleFU
Copy link
Member

@mapleFU mapleFU commented Aug 1, 2023

Rationale for this change

This is from a bug in PLAIN encoding with BooleanArray input. Boolean will introduce bad length when writing arrow data.

This interface is not widely used.

What changes are included in this PR?

Rewrite PLAIN boolean encoder to use TypedBufferBuilder instead of an incorrect hand-baked implementation.

Are these changes tested?

Yes

Are there any user-facing changes?

No.

@mapleFU mapleFU requested a review from wgtmac as a code owner August 1, 2023 12:25
@mapleFU mapleFU requested a review from pitrou August 1, 2023 12:25
@mapleFU
Copy link
Member Author

mapleFU commented Aug 1, 2023

@pitrou @wgtmac PTAL

@github-actions
Copy link

github-actions bot commented Aug 1, 2023

⚠️ GitHub issue #36939 has been automatically assigned in GitHub to PR creator.

@mapleFU mapleFU force-pushed the parquet/check-inconsistent-boolean branch from 587e2e0 to c0cd7e0 Compare August 1, 2023 13:00
cpp/src/parquet/encoding.cc Outdated Show resolved Hide resolved
cpp/src/parquet/encoding_test.cc Outdated Show resolved Hide resolved
cpp/src/parquet/encoding_test.cc Outdated Show resolved Hide resolved
cpp/src/parquet/encoding_test.cc Outdated Show resolved Hide resolved
@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Aug 1, 2023
@mapleFU mapleFU force-pushed the parquet/check-inconsistent-boolean branch from 85749de to 1765718 Compare August 2, 2023 03:41
@wgtmac
Copy link
Member

wgtmac commented Aug 4, 2023

@pitrou @emkornfield Do you want to take a look?

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Aug 6, 2023
@@ -365,7 +369,7 @@ class PlainEncoder<BooleanType> : public EncoderImpl, virtual public BooleanEnco
}
writer.Finish();
}
sink_.UnsafeAdvance(data.length());
sink_.UnsafeAdvance(data.length() - data.null_count());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be n_valid?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@emkornfield Parquet encoding stores "valid" value. The invalid value will be marked in rep-levels and def-levels

Copy link
Contributor

@emkornfield emkornfield Aug 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was referring to the entire value passed into sink_.UnsafeAdvance

This seems like it was incorrect even for all present values because we are advancing the Byte buffer by number of values (in this case these would be number of bits) and not number bytes. So we would be overadvancing in both cases?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(i.e. we seem to be advancing by more bytes then are being reserved in both cases)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the logic here is confusing, but it's right after my change. When input k boolean values.

  1. sink_.Reserve will only reserve bytes for bits (k).
  2. sink_.UnsafeAdvance will advance k bytes.

However, when used, sink_.length() will only be regarded as bits. So (2) has a bug, but it works here...

I'd like to fix the bug first, and take time to optimize the code later.

Copy link
Contributor

@emkornfield emkornfield Aug 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case we need to be reserving k bytes and not k/8. I think this wasn't caught sooner because columnwriter appears to have a specialization for Boolean values that bypasses this method (e.g. I don't think anything but encoder tests will fail if you put throw exception(....) in this method in general.

@emkornfield
Copy link
Contributor

General question, is this code path actually used in practice? (should we just delete the code)?

@mapleFU
Copy link
Member Author

mapleFU commented Aug 6, 2023

General question, is this code path actually used in practice? (should we just delete the code)?

We use parquet boolean type to store bool value. And this is exposed in python for a long time. So I think: 1. Boolean type is not widely used, otherwise it would be found earier 2. However, I think there're someone using it. So I think we'd better fix this

@@ -354,6 +357,7 @@ class PlainEncoder<BooleanType> : public EncoderImpl, virtual public BooleanEnco
sink_.length(), n_valid);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this line also seems problematic if this method is called multiple times with in a row with boolean arrays (not sure actual code does this though).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#36972 (comment)
You're right. This would not produce bug if PutArrow is not mixed with PutImpl, but will make Boolean leaves a larger space than expected

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mentioned this below, using sink_.length() to store the actual values seems wrong.

@emkornfield
Copy link
Contributor

emkornfield commented Aug 6, 2023

We use parquet boolean type to store bool value. And this is exposed in python for a long time. So I think:

Yes, I think we got lucky here because I think the python/c++ column writer code calls PutSpaced

@mapleFU
Copy link
Member Author

mapleFU commented Aug 6, 2023

Python/c++ column writer code calls PutSpaced

Thats lucky! Maybe I'll just fix like this and add some comments

@github-actions github-actions bot removed the awaiting changes Awaiting changes label Aug 6, 2023
@conbench-apache-arrow
Copy link

Thanks for your patience. Conbench analyzed the 3 benchmarking runs that have been run so far on PR commit 1a45056.

None of the specified runs had any associated benchmark results.

The full Conbench report has more details. It also includes information about possible false positives for unstable benchmarks that are known to sometimes produce them.

@austin3dickey
Copy link
Contributor

austin3dickey commented Aug 9, 2023

^ Sorry, this is a mistake. There were C++ benchmark results, but they were suppressed because we typically only report on Python and R benchmark results. You can see the results by clicking through to the report. I'll fix this bug soon.

@pitrou
Copy link
Member

pitrou commented Aug 9, 2023

I'd rather not suffer a potentially significant performance regression here. Ideally, writing a boolean Arrow array would go through this function (I'm not sure why it doesn't).

@mapleFU
Copy link
Member Author

mapleFU commented Aug 9, 2023

I'd rather not suffer a potentially significant performance regression here. Ideally, writing a boolean Arrow array would go through this function (I'm not sure why it doesn't).

Sure I'll rewrite it :-)

But I wonder that how can we optimize PutSpaced. Should we just regard it as before?

@pitrou
Copy link
Member

pitrou commented Aug 9, 2023

But I wonder that how can we optimize PutSpaced. Should we just regard it as before?

For now, yes, but it could perhaps be improved if it's often used.

@mapleFU
Copy link
Member Author

mapleFU commented Aug 10, 2023

I've changed to bit_writer_.PutValue try to make non-spaced Put faster. Though it would still slower than CopyBitmap, but I think it's right...

@mapleFU
Copy link
Member Author

mapleFU commented Aug 10, 2023

I've add an valid_bit_length_ to replace sink.size(). @wgtmac @pitrou @emkornfield

@pitrou
Copy link
Member

pitrou commented Aug 10, 2023

Again, why not use TypedBufferBuilder<bool>?

@mapleFU
Copy link
Member Author

mapleFU commented Aug 10, 2023

Again, why not use TypedBufferBuilder<bool>?

Because PutImpl already uses sink. And I think TypedBufferBuilder<bool> is as slow as current implemention.

@pitrou
Copy link
Member

pitrou commented Aug 10, 2023

TypedBufferBuilder<bool> uses CopyBitmap internally, so it should be faster. It should also help automate some of the bookkeeping you're doing by hand.

/// \brief Append bits from a packed bitmap
void UnsafeAppend(const uint8_t* bitmap, int64_t offset, int64_t num_elements) {
if (num_elements == 0) return;
internal::CopyBitmap(bitmap, offset, num_elements, mutable_data(), bit_length_);
false_count_ += num_elements - internal::CountSetBits(bitmap, offset, num_elements);
bit_length_ += num_elements;
}

@mapleFU
Copy link
Member Author

mapleFU commented Aug 10, 2023

Okay, I've refactor all into a TypedBufferBuilder @pitrou

I change PutImpl to use TypedBufferBuilder. I guess we can optimize it later.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice simplification :-)

cpp/src/parquet/encoding.cc Outdated Show resolved Hide resolved
cpp/src/parquet/encoding_test.cc Outdated Show resolved Hide resolved
cpp/src/parquet/encoding_test.cc Outdated Show resolved Hide resolved
cpp/src/parquet/encoding_test.cc Outdated Show resolved Hide resolved
cpp/src/parquet/encoding_test.cc Outdated Show resolved Hide resolved
cpp/src/parquet/encoding_test.cc Outdated Show resolved Hide resolved
@pitrou
Copy link
Member

pitrou commented Aug 10, 2023

@github-actions crossbow submit -g nightly-tests

@github-actions
Copy link

500 [No message]
The Archery job run can be found at: https://github.com/apache/arrow/actions/runs/5823722160

@pitrou
Copy link
Member

pitrou commented Aug 10, 2023

@github-actions crossbow submit -g nightly-tests

@github-actions
Copy link

Revision: 8c23944

Submitted crossbow builds: ursacomputing/crossbow @ actions-a101b56646

Task Status
example-cpp-minimal-build-static Github Actions
example-cpp-minimal-build-static-system-dependency Github Actions
example-python-minimal-build-fedora-conda Github Actions
example-python-minimal-build-ubuntu-venv Github Actions
test-alpine-linux-cpp Github Actions
test-build-cpp-fuzz Github Actions
test-build-vcpkg-win Github Actions
test-conda-cpp Github Actions
test-conda-cpp-valgrind Azure
test-conda-python-3.10 Github Actions
test-conda-python-3.10-hdfs-2.9.2 Github Actions
test-conda-python-3.10-hdfs-3.2.1 Github Actions
test-conda-python-3.10-pandas-latest Github Actions
test-conda-python-3.10-pandas-nightly Github Actions
test-conda-python-3.10-spark-v3.4.1 Github Actions
test-conda-python-3.10-substrait Github Actions
test-conda-python-3.11 Github Actions
test-conda-python-3.11-dask-latest Github Actions
test-conda-python-3.11-dask-upstream_devel Github Actions
test-conda-python-3.11-hypothesis Github Actions
test-conda-python-3.11-pandas-upstream_devel Github Actions
test-conda-python-3.11-spark-master Github Actions
test-conda-python-3.8 Github Actions
test-conda-python-3.8-pandas-1.0 Github Actions
test-conda-python-3.8-spark-v3.4.1 Github Actions
test-conda-python-3.9 Github Actions
test-conda-python-3.9-pandas-latest Github Actions
test-cuda-cpp Github Actions
test-cuda-python Github Actions
test-debian-11-cpp-amd64 Github Actions
test-debian-11-cpp-i386 Github Actions
test-debian-11-go-1.17 Azure
test-debian-11-go-1.20 Azure
test-debian-11-python-3 Azure
test-debian-c-glib Github Actions
test-debian-ruby Github Actions
test-fedora-35-cpp Github Actions
test-fedora-35-python-3 Azure
test-fedora-r-clang-sanitizer Azure
test-r-arrow-backwards-compatibility Github Actions
test-r-depsource-bundled Azure
test-r-depsource-system Github Actions
test-r-dev-duckdb Github Actions
test-r-devdocs Github Actions
test-r-gcc-11 Github Actions
test-r-gcc-12 Github Actions
test-r-install-local Github Actions
test-r-install-local-minsizerel Github Actions
test-r-library-r-base-latest Azure
test-r-linux-as-cran Github Actions
test-r-linux-rchk Github Actions
test-r-linux-valgrind Azure
test-r-minimal-build Azure
test-r-offline-maximal Github Actions
test-r-offline-minimal Azure
test-r-rhub-debian-gcc-devel-lto-latest Azure
test-r-rhub-debian-gcc-release-custom-ccache Azure
test-r-rhub-ubuntu-gcc-release-latest Azure
test-r-rstudio-r-base-4.1-opensuse153 Azure
test-r-rstudio-r-base-4.2-centos7-devtoolset-8 Azure
test-r-rstudio-r-base-4.2-focal Azure
test-r-ubuntu-22.04 Github Actions
test-r-versions Github Actions
test-skyhook-integration Github Actions
test-ubuntu-20.04-cpp Github Actions
test-ubuntu-20.04-cpp-bundled Github Actions
test-ubuntu-20.04-cpp-minimal-with-formats Github Actions
test-ubuntu-20.04-cpp-thread-sanitizer Github Actions
test-ubuntu-20.04-python-3 Azure
test-ubuntu-22.04-cpp Github Actions
test-ubuntu-22.04-cpp-20 Github Actions
test-ubuntu-22.04-docs Github Actions
test-ubuntu-22.04-python-3 Github Actions
test-ubuntu-c-glib Github Actions
test-ubuntu-r-sanitizer Azure
test-ubuntu-ruby Github Actions

@pitrou pitrou changed the title GH-36939: [C++][Parquet] Boolean encoding has inconsistent implemention GH-36939: [C++][Parquet] Direct put of BooleanArray is incorrect when called several times Aug 10, 2023
@pitrou pitrou merged commit be1b003 into apache:main Aug 10, 2023
30 of 32 checks passed
@pitrou pitrou removed the awaiting change review Awaiting change review label Aug 10, 2023
@mapleFU mapleFU deleted the parquet/check-inconsistent-boolean branch August 11, 2023 02:11
@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit be1b003.

There were 2 benchmark results indicating a performance regression:

The full Conbench report has more details. It also includes information about possible false positives for unstable benchmarks that are known to sometimes produce them.

loicalleyne pushed a commit to loicalleyne/arrow that referenced this pull request Nov 13, 2023
…t when called several times (apache#36972)

### Rationale for this change

This is from a bug in PLAIN encoding with `BooleanArray` input. Boolean will introduce bad length when writing arrow data.

This interface is not widely used.

### What changes are included in this PR?

Rewrite PLAIN boolean encoder to use `TypedBufferBuilder` instead of an incorrect hand-baked implementation.

### Are these changes tested?

Yes

### Are there any user-facing changes?

No.

* Closes: apache#36939

Lead-authored-by: mwish <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C++][Parquet] Boolean encoding has inconsistent implemention
6 participants