Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add support for DELTA_BINARY_PACKED and DELTA_BYTE_ARRAY encodings to Parquet reader #12948

Closed
wants to merge 164 commits into from

Conversation

etseidl
Copy link
Contributor

@etseidl etseidl commented Mar 14, 2023

Description

Some Parquet writers will fall back to DELTA_BINARY_PACKED or DELTA_BYTE_ARRAY encoding when dictionary encoding cannot be used. This PR is a first attempt at adding support for these encodings to the Parquet reader. A description of these encodings can be found starting here.

I'm mostly looking for feedback on my approach right now. In particular, the final decoding of strings in DELTA_BYTE_ARRAY. Each string is encoded as a prefix length from the preceding string, a suffix length, and then the suffix bytes. To reconstruct string_i, you need prefix_length(i) bytes from string_(i-1), which at first blush seems to be a serial operation. I've used a few cheats to try to be a bit more parallel, but am open to suggestions to make it even more so. The logic for this is in the StringScan function starting at line 2105 of page_data.cu.

I'm also wondering if it makes more sense to use all 128 threads to do decoding, rather than the current approach of using one warp for rep/def level decoding, one or two warps for delta decoding, and one warp outputting values (which mirrors how the current decoder works).

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

etseidl and others added 30 commits February 9, 2023 14:07
GPUtester and others added 16 commits May 24, 2023 19:53
Forward-merge branch-23.06 to branch-23.08
Forward-merge branch-23.06 to branch-23.08
Forward-merge branch-23.06 to branch-23.08
Bump up JNI version to 23.08.0-SNAPSHOT in branch-23.08

Authors:
  - Peixin (https://github.com/pxLi)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Jason Lowe (https://github.com/jlowe)

URL: rapidsai#13401
Forward-merge branch-23.06 to branch-23.08
Forward-merge branch-23.06 to branch-23.08
Forward-merge branch-23.06 to branch-23.08
Forward-merge branch-23.06 to branch-23.08
Forward-merge branch-23.06 to branch-23.08
)

Cleans up source files for nvtext and io-text pytests. The pytests are placed into separate files: `test_io_text.py` for the io-text pytests and `test_nvtext.py` for the nvtext pytests. Also removed the `python/cudf/cudf/tests/text` folder which contained 2 empty `.py` files.

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Lawrence Mitchell (https://github.com/wence-)

URL: rapidsai#13435
This PR attempts to allow using newer versions of scikit-build again.

cf. rapidsai#13188

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - AJ Schmidt (https://github.com/ajschmidt8)
  - Lawrence Mitchell (https://github.com/wence-)

URL: rapidsai#13424
closes rapidsai#13412
Remove weak references of cleaned resources when a resource is cleaned.
The cleaned objects are never leaked, it's safe to remove the weak references. 
This is to reduce the memory usage.

Authors:
  - Chong Gao (https://github.com/res-life)

Approvers:
  - Jason Lowe (https://github.com/jlowe)
  - Robert (Bobby) Evans (https://github.com/revans2)
  - MithunR (https://github.com/mythrocks)

URL: rapidsai#13378
@github-actions github-actions bot added ci CMake CMake build issue Java Affects Java cuDF API. labels May 30, 2023
GPUtester and others added 4 commits May 30, 2023 14:09
Forward-merge branch-23.06 to branch-23.08
Depends on: rapidsai/rapids-cmake#393

Once the above PR is merged, this updated logic ensures that cudf places the custom versions of cccl packages in correct places, and can find them once installed.

Authors:
  - Robert Maynard (https://github.com/robertmaynard)
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: rapidsai#13235
@etseidl
Copy link
Contributor Author

etseidl commented Jun 2, 2023

Closing for now. Will resubmit as part of #13501

@etseidl etseidl closed this Jun 2, 2023
@etseidl etseidl deleted the feature/delta_binary branch June 26, 2023 15:42
@vyasr vyasr added 4 - Needs Review Waiting for reviewer to review or respond and removed 4 - Needs cuIO Reviewer labels Feb 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
4 - Needs Review Waiting for reviewer to review or respond CMake CMake build issue cuIO cuIO issue feature request New feature or request Java Affects Java cuDF API. libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.