Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix reading of RLE encoded boolean data from parquet files with V2 page headers #13707

Merged
merged 23 commits into from
Jul 26, 2023

Conversation

etseidl
Copy link
Contributor

@etseidl etseidl commented Jul 17, 2023

Description

The current parquet reader assumes that repetition or definition level data with a bit length of 0 will have no data encoded in the header. In the case of V2 headers, this assumption is false. This PR checks the V2 page header data to see if level data needs to be accounted for. Also fixes an error that was present in the RLE data decoder where the encoded length of the RLE data was not skipped properly.

Fixes #13655

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@etseidl etseidl requested review from a team as code owners July 17, 2023 17:35
@rapids-bot
Copy link

rapids-bot bot commented Jul 17, 2023

Pull requests from external contributors require approval from a rapidsai organization member with write permissions or greater before CI can begin.

@github-actions github-actions bot added libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. labels Jul 17, 2023
@GregoryKimball
Copy link
Contributor

/ok to test

@hyperbolic2346 hyperbolic2346 added bug Something isn't working cuIO cuIO issue non-breaking Non-breaking change labels Jul 18, 2023
@hyperbolic2346
Copy link
Contributor

/ok to test

Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, appreciate that the code duplication was avoided.
A few mildly confused comments below :)

cpp/src/io/parquet/page_hdr.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/page_decode.cuh Outdated Show resolved Hide resolved
@vuule
Copy link
Contributor

vuule commented Jul 21, 2023

/ok to test

@vuule
Copy link
Contributor

vuule commented Jul 21, 2023

/ok to test

@hyperbolic2346
Copy link
Contributor

/ok to test

Comment on lines +912 to +914
} else {
init_rle(cur, cur + len);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it always going to be safe to do this? That is, is it always the case that for V2 headers, there will be a valid varint, regardless of whether level_bits is 0 or not?

Copy link
Contributor Author

@etseidl etseidl Jul 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

init_rle is only called if the number of bytes for the level is non-zero (len != 0). So at this point there should be some bytes to read. The problem file actually only encodes the RLE length, but not the RLE value (which is then assumed to be 0). That's why the test for cur < end in init_rle

FWIW I'm planning to refactor the def/rep_lvl_bytes into an array to match other bits of the PageInfo struct. Then this bit of logic gets a little clearer.

Comment on lines +1271 to +1275
// first 4 bytes are length of RLE data
int const len = (cur[0]) + (cur[1] << 8) + (cur[2] << 16) + (cur[3] << 24);
cur += 4;
if (cur + len > end) { s->error = 2; }
s->dict_run = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

File this under "how did this ever work?". What's the story here - how did RLE decoding work if we were starting 4 bytes off?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

File this under "how did this ever work?". What's the story here - how did RLE decoding work if we were starting 4 bytes off?

I'm guessing you never ran into RLE encoded bool data before 😉

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some light summer reading 😉 Near as I can tell booleans will only be encoded with RLE with V2 writers (although it seems arrow-rs might allow RLE w/ V1, but not by default). So it likely is true that this code has never been exercised.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the encoder has RLE boolean support but it's disabled. If I enable it, it writes RLE encoded boolean data, but lacks the 4 bytes of length required by the Parquet spec. So I guess that answers the question of how did this ever work.

Once this PR and #13751 are merged I'll submit a PR to enable RLE encoding for booleans when writing V2 files.

@nvdbaranec
Copy link
Contributor

Currently running this through the spark plugin integration tests. Should be done in ~20 minutes.

Copy link
Contributor

@nvdbaranec nvdbaranec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests pass. LGTM.

@vuule
Copy link
Contributor

vuule commented Jul 26, 2023

/ok to test

@PointKernel
Copy link
Member

/ok to test

@PointKernel
Copy link
Member

/merge

@rapids-bot rapids-bot bot merged commit 55894bf into rapidsai:branch-23.08 Jul 26, 2023
54 checks passed
@etseidl etseidl deleted the feature/fix_rle_bool branch July 26, 2023 19:58
rapids-bot bot pushed a commit that referenced this pull request Aug 15, 2023
This PR replaces the `def_lvl_bytes` and `rep_lvl_bytes` fields of the `gpu::PageInfo` struct with an array indexed by `gpu::level_type` (as is done with the `lvl_decode_buf` array). This allows for some streamlining in `InitLevelSection()`, removing some redundant code and improving readability.

See this [comment](#13707 (comment)) for context.

Authors:
  - Ed Seidl (https://github.com/etseidl)
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - Vukasin Milovanovic (https://github.com/vuule)
  - Nghia Truong (https://github.com/ttnghia)

URL: #13775
rapids-bot bot pushed a commit that referenced this pull request Aug 17, 2023
While working on #13707 it was noticed that RLE encoding of booleans had been implemented and then disabled (see [this comment](#13707 (comment)) for details). This PR re-enables RLE encoding for booleans, but only when V2 headers are being used.

Part of #13501.

Authors:
  - Ed Seidl (https://github.com/etseidl)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Vukasin Milovanovic (https://github.com/vuule)

URL: #13886
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
Archived in project
7 participants