Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write string data directly to column_buffer in Parquet reader #13302

Merged
merged 193 commits into from
Jun 23, 2023

Conversation

etseidl
Copy link
Contributor

@etseidl etseidl commented May 5, 2023

Description

The current Parquet reader decodes string data into a list of {ptr, length} tuples, which are then used in a gather step by make_strings_column. This gather step can be time consuming, especially when there are a large number of string columns. This PR addresses this by changing the decode step to write char and offset data directly to the column_buffer, which can then be used directly, bypassing the gather step.

The image below compares the new approach to the old. The green arc at the top (82ms) is gpuDecodePageData, and the red arc (252ms) is the time spent in make_strings_column. The green arc below (25ms) is gpuDecodePageData, the amber arc (22ms) is a new kernel that computes string sizes for each page, and the magenta arc (106ms) is the kernel that decodes string columns.
flat_edited

NVbench shows a good speed up for strings as well. There is a jump in time for the INTEGRAL benchmark, but little to no change for other data types. The INTEGRAL time seems to be affected by extra time spent in malloc allocating host memory for a hostdevice_vector. This malloc always occurs, but for some reason in this branch it takes much longer to return.

This is comparing to @nvdbaranec's branch for #13203.

|  data_type  |      io       |  cardinality  |  run_length  |   Ref Time |   Cmp Time |        Diff |   %Diff |  
|-------------|---------------|---------------|--------------|------------|------------|-------------|---------| 
|  INTEGRAL   | DEVICE_BUFFER |       0       |      1       |  14.288 ms |  14.729 ms |  440.423 us |   3.08% |   
|  INTEGRAL   | DEVICE_BUFFER |     1000      |      1       |  13.397 ms |  13.997 ms |  600.596 us |   4.48% |   
|  INTEGRAL   | DEVICE_BUFFER |       0       |      32      |  11.831 ms |  12.354 ms |  522.485 us |   4.42% |   
|  INTEGRAL   | DEVICE_BUFFER |     1000      |      32      |  11.335 ms |  11.854 ms |  518.791 us |   4.58% |   
|    FLOAT    | DEVICE_BUFFER |       0       |      1       |   8.681 ms |   8.715 ms |   34.846 us |   0.40% |   
|    FLOAT    | DEVICE_BUFFER |     1000      |      1       |   8.473 ms |   8.472 ms |   -0.680 us |  -0.01% |   
|    FLOAT    | DEVICE_BUFFER |       0       |      32      |   7.217 ms |   7.192 ms |  -25.311 us |  -0.35% |   
|    FLOAT    | DEVICE_BUFFER |     1000      |      32      |   7.425 ms |   7.422 ms |   -3.162 us |  -0.04% |   
|   STRING    | DEVICE_BUFFER |       0       |      1       |  50.079 ms |  42.566 ms |-7513.004 us | -15.00% |   
|   STRING    | DEVICE_BUFFER |     1000      |      1       |  16.813 ms |  14.989 ms |-1823.660 us | -10.85% |   
|   STRING    | DEVICE_BUFFER |       0       |      32      |  49.875 ms |  42.443 ms |-7432.718 us | -14.90% |   
|   STRING    | DEVICE_BUFFER |     1000      |      32      |  15.312 ms |  13.953 ms |-1358.910 us |  -8.87% |   
|    LIST     | DEVICE_BUFFER |       0       |      1       |  80.303 ms |  80.688 ms |  385.916 us |   0.48% |   
|    LIST     | DEVICE_BUFFER |     1000      |      1       |  71.921 ms |  72.356 ms |  435.153 us |   0.61% |   
|    LIST     | DEVICE_BUFFER |       0       |      32      |  61.658 ms |  62.129 ms |  471.022 us |   0.76% |   
|    LIST     | DEVICE_BUFFER |     1000      |      32      |  63.086 ms |  63.371 ms |  285.608 us |   0.45% |   
|   STRUCT    | DEVICE_BUFFER |       0       |      1       |  66.272 ms |  61.142 ms |-5130.639 us |  -7.74% |   
|   STRUCT    | DEVICE_BUFFER |     1000      |      1       |  40.217 ms |  39.328 ms | -888.781 us |  -2.21% |   
|   STRUCT    | DEVICE_BUFFER |       0       |      32      |  63.660 ms |  58.837 ms |-4822.647 us |  -7.58% |   
|   STRUCT    | DEVICE_BUFFER |     1000      |      32      |  38.080 ms |  37.104 ms | -976.133 us |  -2.56% | 

May address #13024

Depends on #13203

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

nvdbaranec and others added 30 commits April 23, 2023 17:49
…, it was only 1 warp wide. Now it is block-wide.

Only integrated into the gpuComputePageSizes() kernel.  gpuDecodePages() will be a followup PR.
…al with a performance issue introduced in gpuDecodePageData by

previously changing them to be pointers instead of hardcoded arrays.
@vuule
Copy link
Contributor

vuule commented Jun 22, 2023

/ok to test

@ttnghia
Copy link
Contributor

ttnghia commented Jun 23, 2023

/ok to test

@@ -663,38 +663,19 @@ __global__ void __launch_bounds__(decode_block_size) gpuDecodeStringPageData(
page_state_buffers_s* const sb = &state_buffers;
int const page_idx = blockIdx.x;
int const t = threadIdx.x;
[[maybe_unused]] null_count_back_copier _{s, t};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this avoid the race condition when two separate kernels visit the same page? Won't one of them erroneously zero the page out that another may have written a valid value to?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only one invocation should make it past the filter. That one will zero out the null count and then the back copier will copy it back to the page. @vuule added the logic to make the back copy a no-op if the setup returns early.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. Checking to see if the nesting_info pointer is null.

Copy link
Contributor

@nvdbaranec nvdbaranec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ship it.

@vuule
Copy link
Contributor

vuule commented Jun 23, 2023

/merge

@vuule
Copy link
Contributor

vuule commented Jun 23, 2023

oops, missing a cmake review
Edit: asked for one, will marge as soon as we get that approval.

Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CMake approval

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
4 - Needs Review Waiting for reviewer to review or respond CMake CMake build issue cuIO cuIO issue improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Performance Performance related issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants