Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Prevent silent fallback to uncompressed when writing parquet files with ZSTD compression #15501

Closed
GregoryKimball opened this issue Apr 10, 2024 · 2 comments · Fixed by #15570
Assignees
Labels
bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS

Comments

@GregoryKimball
Copy link
Contributor

GregoryKimball commented Apr 10, 2024

Describe the bug
We've identified some cases with long strings columns where the parquet writer generates >16 MB pages and falls back to an uncompressed write. Due to a hard limit of 16 MB in nvcomp ZSTD compression API, if an encoded page size exceeds 16 MB, we can no longer call use nvcomp's ZSTD codec. Commonly, the large page is the dictionary page in a dict-encoded column.

Steps/Code to reproduce bug
I will post a repro file once we have something that can be publicly shared. @pmixer we would love your help here.

Expected behavior
We expected ZSTD compression to be used when it is requested.

Desired change
These options are still under discussion. We may opt for one, or both, or something else.

@GregoryKimball GregoryKimball added bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue Spark Functionality that helps Spark RAPIDS labels Apr 10, 2024
@GregoryKimball
Copy link
Contributor Author

Hello @mhaseeb123, after some investigation with @vuule and @etseidl, we think that a good option here could be changing the libcudf and cuDF-python default dictionary_policy to ADAPTIVE. Would you please create a draft PR to change the default? I would like to request evaluation by Spark in the next week or two (FYI @revans2 and @nvdbaranec)

@vuule
Copy link
Contributor

vuule commented Apr 19, 2024

That was fast!

Should we adjust the behavior of ADAPTIVE to be less restrictive?
One proposal that came out of discussion with Ed is to match compression block limit if it exists (if not, behavior is the same as ALWAYS) and follow user-specified limit if it's set. Just defaulting to ADAPTIVE and keeping a hard-coded limit might lead to larger files when we give up on dictionaries even when they don't interfere with compression.

rapids-bot bot pushed a commit that referenced this issue May 11, 2024
…to `ADAPTIVE` (#15570)

This PR changes the default dictionary policy in parquet from `ALWAYS` to `ADAPTIVE` and adds an argument `max_dictionary_size` to control the `ADAPTIVE`-ness of the dictionary policy. This change prevents a silent fallback to `UNCOMPRESSED` when writing parquet files with `ZSTD` compression leading to better performance for several use cases.

Partially closes #15501.

Authors:
  - Muhammad Haseeb (https://github.com/mhaseeb123)
  - Vukasin Milovanovic (https://github.com/vuule)

Approvers:
  - Vukasin Milovanovic (https://github.com/vuule)
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - Bradley Dice (https://github.com/bdice)

URL: #15570
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants