[BUG] Prevent silent fallback to uncompressed when writing parquet files with ZSTD compression #15501

GregoryKimball · 2024-04-10T15:51:51Z

Describe the bug
We've identified some cases with long strings columns where the parquet writer generates >16 MB pages and falls back to an uncompressed write. Due to a hard limit of 16 MB in nvcomp ZSTD compression API, if an encoded page size exceeds 16 MB, we can no longer call use nvcomp's ZSTD codec. Commonly, the large page is the dictionary page in a dict-encoded column.

Steps/Code to reproduce bug
I will post a repro file once we have something that can be publicly shared. @pmixer we would love your help here.

Expected behavior
We expected ZSTD compression to be used when it is requested.

Desired change
These options are still under discussion. We may opt for one, or both, or something else.

Drop the 16 MB limit in the nvcomp ZSTD compression API. We may choose to throw a warning that the requested page is large.
Set the libcudf dictionary policy to "Adaptive" when ZSTD is in use. and maybe always (see Selectively use dictionary encoding in Parquet writer #12211)

GregoryKimball · 2024-04-18T21:43:35Z

Hello @mhaseeb123, after some investigation with @vuule and @etseidl, we think that a good option here could be changing the libcudf and cuDF-python default dictionary_policy to ADAPTIVE. Would you please create a draft PR to change the default? I would like to request evaluation by Spark in the next week or two (FYI @revans2 and @nvdbaranec)

vuule · 2024-04-19T18:56:39Z

That was fast!

Should we adjust the behavior of ADAPTIVE to be less restrictive?
One proposal that came out of discussion with Ed is to match compression block limit if it exists (if not, behavior is the same as ALWAYS) and follow user-specified limit if it's set. Just defaulting to ADAPTIVE and keeping a hard-coded limit might lead to larger files when we give up on dictionaries even when they don't interfere with compression.

…to `ADAPTIVE` (#15570) This PR changes the default dictionary policy in parquet from `ALWAYS` to `ADAPTIVE` and adds an argument `max_dictionary_size` to control the `ADAPTIVE`-ness of the dictionary policy. This change prevents a silent fallback to `UNCOMPRESSED` when writing parquet files with `ZSTD` compression leading to better performance for several use cases. Partially closes #15501. Authors: - Muhammad Haseeb (https://github.com/mhaseeb123) - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - GALI PREM SAGAR (https://github.com/galipremsagar) - Bradley Dice (https://github.com/bdice) URL: #15570

GregoryKimball added bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue Spark Functionality that helps Spark RAPIDS labels Apr 10, 2024

GregoryKimball added this to the Parquet continuous improvement milestone Apr 10, 2024

GregoryKimball assigned mhaseeb123 Apr 18, 2024

mhaseeb123 mentioned this issue Apr 19, 2024

Change the default dictionary policy in Parquet writer from ALWAYS to ADAPTIVE #15570

Merged

3 tasks

abellina mentioned this issue Apr 22, 2024

[FEA][JNI] Consider defaulting parquet dictionary encoding policy to ALWAYS #15580

Open

rapids-bot bot closed this as completed in #15570 May 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Prevent silent fallback to uncompressed when writing parquet files with ZSTD compression #15501

[BUG] Prevent silent fallback to uncompressed when writing parquet files with ZSTD compression #15501

GregoryKimball commented Apr 10, 2024 •

edited

Loading

GregoryKimball commented Apr 18, 2024

vuule commented Apr 19, 2024

[BUG] Prevent silent fallback to uncompressed when writing parquet files with ZSTD compression #15501

[BUG] Prevent silent fallback to uncompressed when writing parquet files with ZSTD compression #15501

Comments

GregoryKimball commented Apr 10, 2024 • edited Loading

GregoryKimball commented Apr 18, 2024

vuule commented Apr 19, 2024

GregoryKimball commented Apr 10, 2024 •

edited

Loading