Write string data directly to column_buffer in Parquet reader #13302

etseidl · 2023-05-05T18:41:07Z

Description

The current Parquet reader decodes string data into a list of {ptr, length} tuples, which are then used in a gather step by make_strings_column. This gather step can be time consuming, especially when there are a large number of string columns. This PR addresses this by changing the decode step to write char and offset data directly to the column_buffer, which can then be used directly, bypassing the gather step.

The image below compares the new approach to the old. The green arc at the top (82ms) is gpuDecodePageData, and the red arc (252ms) is the time spent in make_strings_column. The green arc below (25ms) is gpuDecodePageData, the amber arc (22ms) is a new kernel that computes string sizes for each page, and the magenta arc (106ms) is the kernel that decodes string columns.

NVbench shows a good speed up for strings as well. There is a jump in time for the INTEGRAL benchmark, but little to no change for other data types. The INTEGRAL time seems to be affected by extra time spent in malloc allocating host memory for a hostdevice_vector. This malloc always occurs, but for some reason in this branch it takes much longer to return.

This is comparing to @nvdbaranec's branch for #13203.

|  data_type  |      io       |  cardinality  |  run_length  |   Ref Time |   Cmp Time |        Diff |   %Diff |  
|-------------|---------------|---------------|--------------|------------|------------|-------------|---------| 
|  INTEGRAL   | DEVICE_BUFFER |       0       |      1       |  14.288 ms |  14.729 ms |  440.423 us |   3.08% |   
|  INTEGRAL   | DEVICE_BUFFER |     1000      |      1       |  13.397 ms |  13.997 ms |  600.596 us |   4.48% |   
|  INTEGRAL   | DEVICE_BUFFER |       0       |      32      |  11.831 ms |  12.354 ms |  522.485 us |   4.42% |   
|  INTEGRAL   | DEVICE_BUFFER |     1000      |      32      |  11.335 ms |  11.854 ms |  518.791 us |   4.58% |   
|    FLOAT    | DEVICE_BUFFER |       0       |      1       |   8.681 ms |   8.715 ms |   34.846 us |   0.40% |   
|    FLOAT    | DEVICE_BUFFER |     1000      |      1       |   8.473 ms |   8.472 ms |   -0.680 us |  -0.01% |   
|    FLOAT    | DEVICE_BUFFER |       0       |      32      |   7.217 ms |   7.192 ms |  -25.311 us |  -0.35% |   
|    FLOAT    | DEVICE_BUFFER |     1000      |      32      |   7.425 ms |   7.422 ms |   -3.162 us |  -0.04% |   
|   STRING    | DEVICE_BUFFER |       0       |      1       |  50.079 ms |  42.566 ms |-7513.004 us | -15.00% |   
|   STRING    | DEVICE_BUFFER |     1000      |      1       |  16.813 ms |  14.989 ms |-1823.660 us | -10.85% |   
|   STRING    | DEVICE_BUFFER |       0       |      32      |  49.875 ms |  42.443 ms |-7432.718 us | -14.90% |   
|   STRING    | DEVICE_BUFFER |     1000      |      32      |  15.312 ms |  13.953 ms |-1358.910 us |  -8.87% |   
|    LIST     | DEVICE_BUFFER |       0       |      1       |  80.303 ms |  80.688 ms |  385.916 us |   0.48% |   
|    LIST     | DEVICE_BUFFER |     1000      |      1       |  71.921 ms |  72.356 ms |  435.153 us |   0.61% |   
|    LIST     | DEVICE_BUFFER |       0       |      32      |  61.658 ms |  62.129 ms |  471.022 us |   0.76% |   
|    LIST     | DEVICE_BUFFER |     1000      |      32      |  63.086 ms |  63.371 ms |  285.608 us |   0.45% |   
|   STRUCT    | DEVICE_BUFFER |       0       |      1       |  66.272 ms |  61.142 ms |-5130.639 us |  -7.74% |   
|   STRUCT    | DEVICE_BUFFER |     1000      |      1       |  40.217 ms |  39.328 ms | -888.781 us |  -2.21% |   
|   STRUCT    | DEVICE_BUFFER |       0       |      32      |  63.660 ms |  58.837 ms |-4822.647 us |  -7.58% |   
|   STRUCT    | DEVICE_BUFFER |     1000      |      32      |  38.080 ms |  37.104 ms | -976.133 us |  -2.56% |

May address #13024

~~Depends on #13203~~

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

…, it was only 1 warp wide. Now it is block-wide. Only integrated into the gpuComputePageSizes() kernel. gpuDecodePages() will be a followup PR.

… feature/string_cols_v2

still not quite happy

…al with a performance issue introduced in gpuDecodePageData by previously changing them to be pointers instead of hardcoded arrays.

… feature/string_cols_v2

cpp/src/io/parquet/reader_impl.cpp

This reverts commit a4548e7.

vuule · 2023-06-22T23:02:34Z

/ok to test

ttnghia · 2023-06-23T17:10:21Z

/ok to test

nvdbaranec · 2023-06-23T17:55:36Z

cpp/src/io/parquet/page_string_decode.cu

@@ -663,38 +663,19 @@ __global__ void __launch_bounds__(decode_block_size) gpuDecodeStringPageData(
  page_state_buffers_s* const sb = &state_buffers;
  int const page_idx             = blockIdx.x;
  int const t                    = threadIdx.x;
+  [[maybe_unused]] null_count_back_copier _{s, t};


How does this avoid the race condition when two separate kernels visit the same page? Won't one of them erroneously zero the page out that another may have written a valid value to?

Only one invocation should make it past the filter. That one will zero out the null count and then the back copier will copy it back to the page. @vuule added the logic to make the back copy a no-op if the setup returns early.

Ah, I see. Checking to see if the nesting_info pointer is null.

nvdbaranec

Ship it.

vuule · 2023-06-23T20:53:49Z

/merge

vuule · 2023-06-23T20:54:20Z

oops, missing a cmake review
Edit: asked for one, will marge as soon as we get that approval.

vyasr

CMake approval

nvdbaranec and others added 30 commits April 23, 2023 17:49

Rework of level decoding to be considerably more parallel. Previously…

63a2d88

…, it was only 1 warp wide. Now it is block-wide. Only integrated into the gpuComputePageSizes() kernel. gpuDecodePages() will be a followup PR.

Merge branch 'branch-23.06' into parquet_level_optimization

85dfe8a

Merge branch 'branch-23.06' into parquet_level_optimization

eb37a59

Style formatting.

9211bcc

checkpoint

2a2f6b2

Merge remote-tracking branch 'origin/parquet_level_optimization' into…

0cd8481

… feature/string_cols_v2

checkpoint

2b1f7d5

fix is_bounds_page()

6569684

pass decoders into page_bounds

2f8836b

copy over changes from string_cols

db7e2a4

works except skip_rows

90e214c

fix bug with skip_rows

567a0ab

debug prints

fb45e8c

fix bug in page_bounds

6d89752

optimization for countDictEntries

5035703

Merge branch 'rapidsai:branch-23.06' into feature/string_cols_v2

595e2e1

fix another skip_rows bug, and round robin the countDictEntries calc

37f7d46

fix for chunked reads

19396bf

fix bug with setting the offsets for null values...chunked reader

3780494

still not quite happy

fix edge case where skip_rows ends on a page boundary

4373b8f

move test for long strings

3a39970

more string tweaks

743b3f5

Merge branch 'rapidsai:branch-23.06' into feature/string_cols_v2

ad651cf

change offsets to size_type

08b68d7

Merge branch 'rapidsai:branch-23.06' into feature/string_cols_v2

269043d

Remove definition and repetition levels from page_data_s struct to de…

b79c9ec

…al with a performance issue introduced in gpuDecodePageData by previously changing them to be pointers instead of hardcoded arrays.

Merge remote-tracking branch 'origin/parquet_level_optimization' into…

38792e1

… feature/string_cols_v2

fixes after merging

3320cde

split out separate decoder for string columns

897db8c

remove test for string hash

15f4e12

etseidl added 2 commits June 22, 2023 08:56

filter on data types in setupLocalPageInfo

47cd9e1

remove experimental decode kernel

a1304c2

nvdbaranec reviewed Jun 22, 2023

View reviewed changes

cpp/src/io/parquet/reader_impl.cpp Outdated Show resolved Hide resolved

etseidl and others added 7 commits June 22, 2023 09:33

Revert "move stream pool to impl object"

ce2acbe

This reverts commit a4548e7.

finish moving back to static stream pool

8653b93

add comment for NUM_DECODERS

6ee7b29

call synch on _stream before launching decode kernels

a0db39c

Merge branch 'branch-23.08' into feature/string_cols_v2

a42137a

Merge branch 'rapidsai:branch-23.08' into feature/string_cols_v2

2d1d556

workaround for nvbench shutdown error

b3ebab5

etseidl and others added 2 commits June 23, 2023 10:08

Merge branch 'rapidsai:branch-23.08' into feature/string_cols_v2

19487af

move page bounds check into setupLocalPageInfo

5b3d070

nvdbaranec reviewed Jun 23, 2023

View reviewed changes

nvdbaranec approved these changes Jun 23, 2023

View reviewed changes

vyasr approved these changes Jun 23, 2023

View reviewed changes

rapids-bot bot merged commit 0fc31a7 into rapidsai:branch-23.08 Jun 23, 2023

etseidl mentioned this pull request Jun 23, 2023

[FEA] Performance issue with the Parquet reader for very large schemas (especially when containing strings) #13024

Closed

GregoryKimball assigned vuule and unassigned vuule Jul 17, 2023

etseidl deleted the feature/string_cols_v2 branch August 3, 2023 15:44

GregoryKimball mentioned this pull request Sep 10, 2023

[FEA] Improve ORC reader filtering and performance #13882

Open

abellina mentioned this pull request Dec 4, 2023

[BUG] parquet reader data corruption in nested schema after https://github.com/rapidsai/cudf/pull/13302 NVIDIA/spark-rapids#9948

Closed

vyasr added 4 - Needs Review Waiting for reviewer to review or respond and removed 4 - Needs cuIO Reviewer labels Feb 23, 2024

GregoryKimball mentioned this pull request Mar 13, 2024

[BUG] Address performance hotspot in Parquet decode of "map" strings data #15297

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Write string data directly to column_buffer in Parquet reader #13302

Write string data directly to column_buffer in Parquet reader #13302

etseidl commented May 5, 2023 •

edited

Loading

vuule commented Jun 22, 2023

ttnghia commented Jun 23, 2023

nvdbaranec Jun 23, 2023

etseidl Jun 23, 2023

nvdbaranec Jun 23, 2023

nvdbaranec left a comment

vuule commented Jun 23, 2023

vuule commented Jun 23, 2023 •

edited

Loading

vyasr left a comment

Write string data directly to column_buffer in Parquet reader #13302

Write string data directly to column_buffer in Parquet reader #13302

Conversation

etseidl commented May 5, 2023 • edited Loading

Description

Checklist

vuule commented Jun 22, 2023

ttnghia commented Jun 23, 2023

nvdbaranec Jun 23, 2023

Choose a reason for hiding this comment

etseidl Jun 23, 2023

Choose a reason for hiding this comment

nvdbaranec Jun 23, 2023

Choose a reason for hiding this comment

nvdbaranec left a comment

Choose a reason for hiding this comment

vuule commented Jun 23, 2023

vuule commented Jun 23, 2023 • edited Loading

vyasr left a comment

Choose a reason for hiding this comment

etseidl commented May 5, 2023 •

edited

Loading

vuule commented Jun 23, 2023 •

edited

Loading