Fix reading of RLE encoded boolean data from parquet files with V2 page headers #13707

etseidl · 2023-07-17T17:35:05Z

Description

The current parquet reader assumes that repetition or definition level data with a bit length of 0 will have no data encoded in the header. In the case of V2 headers, this assumption is false. This PR checks the V2 page header data to see if level data needs to be accounted for. Also fixes an error that was present in the RLE data decoder where the encoded length of the RLE data was not skipped properly.

Fixes #13655

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

rapids-bot · 2023-07-17T17:35:10Z

Pull requests from external contributors require approval from a rapidsai organization member with write permissions or greater before CI can begin.

GregoryKimball · 2023-07-17T21:51:59Z

/ok to test

hyperbolic2346 · 2023-07-19T01:33:12Z

/ok to test

vuule

Looks good, appreciate that the code duplication was avoided.
A few mildly confused comments below :)

cpp/src/io/parquet/page_hdr.cu

cpp/src/io/parquet/page_decode.cuh

vuule · 2023-07-21T19:29:57Z

/ok to test

vuule · 2023-07-21T20:06:59Z

/ok to test

hyperbolic2346 · 2023-07-25T14:47:33Z

/ok to test

nvdbaranec · 2023-07-25T20:13:49Z

cpp/src/io/parquet/page_decode.cuh

+    } else {
+      init_rle(cur, cur + len);
+    }


Is it always going to be safe to do this? That is, is it always the case that for V2 headers, there will be a valid varint, regardless of whether level_bits is 0 or not?

init_rle is only called if the number of bytes for the level is non-zero (len != 0). So at this point there should be some bytes to read. The problem file actually only encodes the RLE length, but not the RLE value (which is then assumed to be 0). That's why the test for cur < end in init_rle

FWIW I'm planning to refactor the def/rep_lvl_bytes into an array to match other bits of the PageInfo struct. Then this bit of logic gets a little clearer.

nvdbaranec · 2023-07-25T20:16:31Z

cpp/src/io/parquet/page_decode.cuh

+          // first 4 bytes are length of RLE data
+          int const len = (cur[0]) + (cur[1] << 8) + (cur[2] << 16) + (cur[3] << 24);
+          cur += 4;
+          if (cur + len > end) { s->error = 2; }
+          s->dict_run = 0;


File this under "how did this ever work?". What's the story here - how did RLE decoding work if we were starting 4 bytes off?

File this under "how did this ever work?". What's the story here - how did RLE decoding work if we were starting 4 bytes off?

I'm guessing you never ran into RLE encoded bool data before 😉

Some light summer reading 😉 Near as I can tell booleans will only be encoded with RLE with V2 writers (although it seems arrow-rs might allow RLE w/ V1, but not by default). So it likely is true that this code has never been exercised.

Looks like the encoder has RLE boolean support but it's disabled. If I enable it, it writes RLE encoded boolean data, but lacks the 4 bytes of length required by the Parquet spec. So I guess that answers the question of how did this ever work.

Once this PR and #13751 are merged I'll submit a PR to enable RLE encoding for booleans when writing V2 files.

nvdbaranec · 2023-07-26T00:04:19Z

Currently running this through the spark plugin integration tests. Should be done in ~20 minutes.

nvdbaranec

Tests pass. LGTM.

vuule · 2023-07-26T16:18:39Z

/ok to test

python/cudf/cudf/tests/test_parquet.py

PointKernel · 2023-07-26T17:49:58Z

/ok to test

PointKernel · 2023-07-26T19:58:11Z

/merge

This PR replaces the `def_lvl_bytes` and `rep_lvl_bytes` fields of the `gpu::PageInfo` struct with an array indexed by `gpu::level_type` (as is done with the `lvl_decode_buf` array). This allows for some streamlining in `InitLevelSection()`, removing some redundant code and improving readability. See this [comment](#13707 (comment)) for context. Authors: - Ed Seidl (https://github.com/etseidl) - Nghia Truong (https://github.com/ttnghia) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - Nghia Truong (https://github.com/ttnghia) URL: #13775

While working on #13707 it was noticed that RLE encoding of booleans had been implemented and then disabled (see [this comment](#13707 (comment)) for details). This PR re-enables RLE encoding for booleans, but only when V2 headers are being used. Part of #13501. Authors: - Ed Seidl (https://github.com/etseidl) Approvers: - Bradley Dice (https://github.com/bdice) - Vukasin Milovanovic (https://github.com/vuule) URL: #13886

etseidl and others added 6 commits July 13, 2023 09:49

initial pass

0f51ef1

Merge branch 'rapidsai:branch-23.08' into feature/fix_rle_bool

20d9d25

Merge branch 'rapidsai:branch-23.08' into feature/fix_rle_bool

05171bd

reduce some redundancy

c02a4b9

fix for rle literal run with no value encoded

39a78b1

add python read test

a27ec2a

etseidl requested review from a team as code owners July 17, 2023 17:35

etseidl requested review from vyasr, brandon-b-miller and ttnghia July 17, 2023 17:35

github-actions bot added libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. labels Jul 17, 2023

etseidl and others added 2 commits July 17, 2023 10:35

Merge branch 'branch-23.08' into feature/fix_rle_bool

09b1f30

make encoding const

2b2a8f7

etseidl and others added 4 commits July 17, 2023 15:23

format

694dac2

Merge branch 'rapidsai:branch-23.08' into feature/fix_rle_bool

40d8ffb

Merge branch 'rapidsai:branch-23.08' into feature/fix_rle_bool

7e7c180

Merge branch 'rapidsai:branch-23.08' into feature/fix_rle_bool

c87d759

hyperbolic2346 added bug Something isn't working cuIO cuIO issue non-breaking Non-breaking change labels Jul 18, 2023

etseidl and others added 2 commits July 18, 2023 14:12

test error value before calling init_rle

24cd888

Merge branch 'branch-23.08' into feature/fix_rle_bool

c6d63e8

vuule reviewed Jul 21, 2023

View reviewed changes

cpp/src/io/parquet/page_hdr.cu Outdated Show resolved Hide resolved

cpp/src/io/parquet/page_decode.cuh Outdated Show resolved Hide resolved

etseidl and others added 2 commits July 20, 2023 18:07

Merge branch 'rapidsai:branch-23.08' into feature/fix_rle_bool

68313b6

implement suggestion from review

bb7a20d

etseidl added 2 commits July 21, 2023 12:36

redo V1 RLE a bit

944adbb

treat flags like a mask

4bf1b70

vuule approved these changes Jul 21, 2023

View reviewed changes

GregoryKimball requested a review from nvdbaranec July 24, 2023 20:46

Merge branch 'branch-23.08' into feature/fix_rle_bool

f637051

nvdbaranec reviewed Jul 25, 2023

View reviewed changes

Merge branch 'rapidsai:branch-23.08' into feature/fix_rle_bool

24cd8a4

nvdbaranec approved these changes Jul 26, 2023

View reviewed changes

etseidl added 2 commits July 25, 2023 18:11

Merge branch 'branch-23.08' into feature/fix_rle_bool

0323628

Merge branch 'branch-23.08' into feature/fix_rle_bool

94c8f73

GregoryKimball requested a review from galipremsagar July 26, 2023 17:14

galipremsagar reviewed Jul 26, 2023

View reviewed changes

python/cudf/cudf/tests/test_parquet.py Outdated Show resolved Hide resolved

Update python/cudf/cudf/tests/test_parquet.py

fcf9db0

galipremsagar approved these changes Jul 26, 2023

View reviewed changes

sameerz linked an issue Jul 26, 2023 that may be closed by this pull request

[BUG] Parquet with RLE encoded booleans loads corrupted data NVIDIA/spark-rapids#8630

Closed

PointKernel approved these changes Jul 26, 2023

View reviewed changes

rapids-bot bot merged commit 55894bf into rapidsai:branch-23.08 Jul 26, 2023
54 checks passed

etseidl deleted the feature/fix_rle_bool branch July 26, 2023 19:58

etseidl mentioned this pull request Jul 26, 2023

Refactor Parquet reader handling of V2 page header info #13775

Merged

3 tasks

GregoryKimball assigned etseidl Jul 31, 2023

etseidl mentioned this pull request Aug 15, 2023

Enable RLE boolean encoding for v2 Parquet files #13886

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix reading of RLE encoded boolean data from parquet files with V2 page headers #13707

Fix reading of RLE encoded boolean data from parquet files with V2 page headers #13707

etseidl commented Jul 17, 2023

rapids-bot bot commented Jul 17, 2023

GregoryKimball commented Jul 17, 2023

hyperbolic2346 commented Jul 19, 2023

vuule left a comment

vuule commented Jul 21, 2023

vuule commented Jul 21, 2023

hyperbolic2346 commented Jul 25, 2023

nvdbaranec Jul 25, 2023

etseidl Jul 25, 2023 •

edited

Loading

nvdbaranec Jul 25, 2023

etseidl Jul 25, 2023

etseidl Jul 26, 2023

etseidl Jul 26, 2023

nvdbaranec commented Jul 26, 2023

nvdbaranec left a comment

vuule commented Jul 26, 2023

PointKernel commented Jul 26, 2023

PointKernel commented Jul 26, 2023

Fix reading of RLE encoded boolean data from parquet files with V2 page headers #13707

Fix reading of RLE encoded boolean data from parquet files with V2 page headers #13707

Conversation

etseidl commented Jul 17, 2023

Description

Checklist

rapids-bot bot commented Jul 17, 2023

GregoryKimball commented Jul 17, 2023

hyperbolic2346 commented Jul 19, 2023

vuule left a comment

Choose a reason for hiding this comment

vuule commented Jul 21, 2023

vuule commented Jul 21, 2023

hyperbolic2346 commented Jul 25, 2023

nvdbaranec Jul 25, 2023

Choose a reason for hiding this comment

etseidl Jul 25, 2023 • edited Loading

Choose a reason for hiding this comment

nvdbaranec Jul 25, 2023

Choose a reason for hiding this comment

etseidl Jul 25, 2023

Choose a reason for hiding this comment

etseidl Jul 26, 2023

Choose a reason for hiding this comment

etseidl Jul 26, 2023

Choose a reason for hiding this comment

nvdbaranec commented Jul 26, 2023

nvdbaranec left a comment

Choose a reason for hiding this comment

vuule commented Jul 26, 2023

PointKernel commented Jul 26, 2023

PointKernel commented Jul 26, 2023

etseidl Jul 25, 2023 •

edited

Loading