-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-36939: [C++][Parquet] Direct put of BooleanArray is incorrect when called several times #36972
GH-36939: [C++][Parquet] Direct put of BooleanArray is incorrect when called several times #36972
Conversation
|
587e2e0
to
c0cd7e0
Compare
85749de
to
1765718
Compare
@pitrou @emkornfield Do you want to take a look? |
cpp/src/parquet/encoding.cc
Outdated
@@ -365,7 +369,7 @@ class PlainEncoder<BooleanType> : public EncoderImpl, virtual public BooleanEnco | |||
} | |||
writer.Finish(); | |||
} | |||
sink_.UnsafeAdvance(data.length()); | |||
sink_.UnsafeAdvance(data.length() - data.null_count()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this be n_valid?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@emkornfield Parquet encoding stores "valid" value. The invalid value will be marked in rep-levels and def-levels
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was referring to the entire value passed into sink_.UnsafeAdvance
This seems like it was incorrect even for all present values because we are advancing the Byte buffer by number of values (in this case these would be number of bits) and not number bytes. So we would be overadvancing in both cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(i.e. we seem to be advancing by more bytes then are being reserved in both cases)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the logic here is confusing, but it's right after my change. When input k boolean values.
sink_.Reserve
will only reserve bytes for bits (k).sink_.UnsafeAdvance
will advance k bytes.
However, when used, sink_.length()
will only be regarded as bits. So (2) has a bug, but it works here...
I'd like to fix the bug first, and take time to optimize the code later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case we need to be reserving k bytes and not k/8. I think this wasn't caught sooner because columnwriter appears to have a specialization for Boolean values that bypasses this method (e.g. I don't think anything but encoder tests will fail if you put throw exception(....)
in this method in general.
General question, is this code path actually used in practice? (should we just delete the code)? |
|
cpp/src/parquet/encoding.cc
Outdated
@@ -354,6 +357,7 @@ class PlainEncoder<BooleanType> : public EncoderImpl, virtual public BooleanEnco | |||
sink_.length(), n_valid); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this line also seems problematic if this method is called multiple times with in a row with boolean arrays (not sure actual code does this though).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#36972 (comment)
You're right. This would not produce bug if PutArrow is not mixed with PutImpl, but will make Boolean leaves a larger space than expected
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mentioned this below, using sink_.length() to store the actual values seems wrong.
Yes, I think we got lucky here because I think the python/c++ column writer code calls PutSpaced |
Thats lucky! Maybe I'll just fix like this and add some comments |
Thanks for your patience. Conbench analyzed the 3 benchmarking runs that have been run so far on PR commit 1a45056. None of the specified runs had any associated benchmark results. The full Conbench report has more details. It also includes information about possible false positives for unstable benchmarks that are known to sometimes produce them. |
^ Sorry, this is a mistake. There were C++ benchmark results, but they were suppressed because we typically only report on Python and R benchmark results. You can see the results by clicking through to the report. I'll fix this bug soon. |
I'd rather not suffer a potentially significant performance regression here. Ideally, writing a boolean Arrow array would go through this function (I'm not sure why it doesn't). |
Sure I'll rewrite it :-) But I wonder that how can we optimize |
For now, yes, but it could perhaps be improved if it's often used. |
I've changed to |
I've add an |
Again, why not use |
Because |
arrow/cpp/src/arrow/buffer_builder.h Lines 377 to 383 in f93929f
|
Okay, I've refactor all into a I change |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a nice simplification :-)
@github-actions crossbow submit -g nightly-tests |
|
@github-actions crossbow submit -g nightly-tests |
Revision: 8c23944 Submitted crossbow builds: ursacomputing/crossbow @ actions-a101b56646 |
After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit be1b003. There were 2 benchmark results indicating a performance regression:
The full Conbench report has more details. It also includes information about possible false positives for unstable benchmarks that are known to sometimes produce them. |
…t when called several times (apache#36972) ### Rationale for this change This is from a bug in PLAIN encoding with `BooleanArray` input. Boolean will introduce bad length when writing arrow data. This interface is not widely used. ### What changes are included in this PR? Rewrite PLAIN boolean encoder to use `TypedBufferBuilder` instead of an incorrect hand-baked implementation. ### Are these changes tested? Yes ### Are there any user-facing changes? No. * Closes: apache#36939 Lead-authored-by: mwish <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
Rationale for this change
This is from a bug in PLAIN encoding with
BooleanArray
input. Boolean will introduce bad length when writing arrow data.This interface is not widely used.
What changes are included in this PR?
Rewrite PLAIN boolean encoder to use
TypedBufferBuilder
instead of an incorrect hand-baked implementation.Are these changes tested?
Yes
Are there any user-facing changes?
No.