You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
We've identified some cases with long strings columns where the parquet writer generates >16 MB pages and falls back to an uncompressed write. Due to a hard limit of 16 MB in nvcomp ZSTD compression API, if an encoded page size exceeds 16 MB, we can no longer call use nvcomp's ZSTD codec. Commonly, the large page is the dictionary page in a dict-encoded column.
Steps/Code to reproduce bug
I will post a repro file once we have something that can be publicly shared. @pmixer we would love your help here.
Expected behavior
We expected ZSTD compression to be used when it is requested.
Desired change
These options are still under discussion. We may opt for one, or both, or something else.
Drop the 16 MB limit in the nvcomp ZSTD compression API. We may choose to throw a warning that the requested page is large.
Hello @mhaseeb123, after some investigation with @vuule and @etseidl, we think that a good option here could be changing the libcudf and cuDF-python default dictionary_policy to ADAPTIVE. Would you please create a draft PR to change the default? I would like to request evaluation by Spark in the next week or two (FYI @revans2 and @nvdbaranec)
Should we adjust the behavior of ADAPTIVE to be less restrictive?
One proposal that came out of discussion with Ed is to match compression block limit if it exists (if not, behavior is the same as ALWAYS) and follow user-specified limit if it's set. Just defaulting to ADAPTIVE and keeping a hard-coded limit might lead to larger files when we give up on dictionaries even when they don't interfere with compression.
Describe the bug
We've identified some cases with long strings columns where the parquet writer generates >16 MB pages and falls back to an uncompressed write. Due to a hard limit of 16 MB in nvcomp ZSTD compression API, if an encoded page size exceeds 16 MB, we can no longer call use nvcomp's ZSTD codec. Commonly, the large page is the dictionary page in a dict-encoded column.
Steps/Code to reproduce bug
I will post a repro file once we have something that can be publicly shared. @pmixer we would love your help here.
Expected behavior
We expected ZSTD compression to be used when it is requested.
Desired change
These options are still under discussion. We may opt for one, or both, or something else.
The text was updated successfully, but these errors were encountered: