-
Notifications
You must be signed in to change notification settings - Fork 889
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix reading of RLE encoded boolean data from parquet files with V2 page headers #13707
Fix reading of RLE encoded boolean data from parquet files with V2 page headers #13707
Conversation
Pull requests from external contributors require approval from a |
/ok to test |
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, appreciate that the code duplication was avoided.
A few mildly confused comments below :)
/ok to test |
/ok to test |
/ok to test |
} else { | ||
init_rle(cur, cur + len); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it always going to be safe to do this? That is, is it always the case that for V2 headers, there will be a valid varint, regardless of whether level_bits
is 0 or not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
init_rle is only called if the number of bytes for the level is non-zero (len != 0
). So at this point there should be some bytes to read. The problem file actually only encodes the RLE length, but not the RLE value (which is then assumed to be 0). That's why the test for cur < end
in init_rle
FWIW I'm planning to refactor the def/rep_lvl_bytes
into an array to match other bits of the PageInfo
struct. Then this bit of logic gets a little clearer.
// first 4 bytes are length of RLE data | ||
int const len = (cur[0]) + (cur[1] << 8) + (cur[2] << 16) + (cur[3] << 24); | ||
cur += 4; | ||
if (cur + len > end) { s->error = 2; } | ||
s->dict_run = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
File this under "how did this ever work?". What's the story here - how did RLE decoding work if we were starting 4 bytes off?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
File this under "how did this ever work?". What's the story here - how did RLE decoding work if we were starting 4 bytes off?
I'm guessing you never ran into RLE encoded bool data before 😉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some light summer reading 😉 Near as I can tell booleans will only be encoded with RLE with V2 writers (although it seems arrow-rs might allow RLE w/ V1, but not by default). So it likely is true that this code has never been exercised.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the encoder has RLE boolean support but it's disabled. If I enable it, it writes RLE encoded boolean data, but lacks the 4 bytes of length required by the Parquet spec. So I guess that answers the question of how did this ever work.
Once this PR and #13751 are merged I'll submit a PR to enable RLE encoding for booleans when writing V2 files.
Currently running this through the spark plugin integration tests. Should be done in ~20 minutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests pass. LGTM.
/ok to test |
/ok to test |
/merge |
This PR replaces the `def_lvl_bytes` and `rep_lvl_bytes` fields of the `gpu::PageInfo` struct with an array indexed by `gpu::level_type` (as is done with the `lvl_decode_buf` array). This allows for some streamlining in `InitLevelSection()`, removing some redundant code and improving readability. See this [comment](#13707 (comment)) for context. Authors: - Ed Seidl (https://github.com/etseidl) - Nghia Truong (https://github.com/ttnghia) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - Nghia Truong (https://github.com/ttnghia) URL: #13775
While working on #13707 it was noticed that RLE encoding of booleans had been implemented and then disabled (see [this comment](#13707 (comment)) for details). This PR re-enables RLE encoding for booleans, but only when V2 headers are being used. Part of #13501. Authors: - Ed Seidl (https://github.com/etseidl) Approvers: - Bradley Dice (https://github.com/bdice) - Vukasin Milovanovic (https://github.com/vuule) URL: #13886
Description
The current parquet reader assumes that repetition or definition level data with a bit length of 0 will have no data encoded in the header. In the case of V2 headers, this assumption is false. This PR checks the V2 page header data to see if level data needs to be accounted for. Also fixes an error that was present in the RLE data decoder where the encoded length of the RLE data was not skipped properly.
Fixes #13655
Checklist