Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose Dictionary to reader #1270

Closed
zeevm opened this issue Feb 4, 2022 · 5 comments · Fixed by #1271
Closed

Expose Dictionary to reader #1270

zeevm opened this issue Feb 4, 2022 · 5 comments · Fixed by #1271
Labels
enhancement Any new improvement worthy of a entry in the changelog

Comments

@zeevm
Copy link
Contributor

zeevm commented Feb 4, 2022

Many Parquet query engines have optimizations that rely on Dictionary encoded columns, e.g. for selections with filter.

The Rust implementation of the Parquet reader makes it difficult for a reader to read dictionary encoded values because it doesn't expose the RLE decoder to the reader code, so a reader that wishes to work with dictionary values has to re-implement an RLE decoder to read values from dictionary encoded data pages.

This can be easily addressed by making the RLE code public outside the crate.

@zeevm zeevm added the enhancement Any new improvement worthy of a entry in the changelog label Feb 4, 2022
@alamb
Copy link
Contributor

alamb commented Feb 4, 2022

FYI I think @tustvold has some plans to contribute functionality that may be similar to the parqet crate directly in #1191

@tustvold
Copy link
Contributor

tustvold commented Feb 4, 2022

I'd be very interested in any details you can share about your particular use-case, in particular if there is anyway we might be able to combine efforts in this space. The proposal in #1191 is just that, and any input you'd be willing to provide would be most appreciated 👍

If you're using arrow, I'd also potentially draw your attention to #1180 which will preserve the dictionary encoding present in the parquet file for dictionary arrays, and is slated for inclusion as the default behaviour in arrow 9.

@zeevm
Copy link
Contributor Author

zeevm commented Feb 5, 2022

@alamb @tustvold My use case is a proprietary analytical DB engine, it has its' own proprietary storage format but also allows running queries against external formats like Parquet.

As it already has a highly optimized scan capability of dictionary encoded data, all I want is for it to have access to the raw Parquet dictionary.

I don't want to take a dependency on Arrow array for that as I'm not using Arrow at all, I don't deserialize Parquet into Arrow since the engine I'm working on has its' own in-memory representation (I don't even build Arrow with Parquet)

@tustvold
Copy link
Contributor

tustvold commented Feb 5, 2022

Thank you for taking the time to respond, I figured that might be the case, but thought it couldn't hurt to check.

I'm sure you're aware, but just as a heads up if you're reading the data directly, the RLE encoding is not length preserving #1111 (comment), and a column chunk may not be consistently dictionary encoded (e.g. if the dictionary gets too large).

FWIW there were some generics added in #1041 and evolved since to aid decoding columns to custom in-memory representations. They're currently crate-local, but that could be changed if you wished to use them. Just let me know 😀

@zeevm
Copy link
Contributor Author

zeevm commented Feb 5, 2022

@tustvold My engine assumes a column is either fully dictionary encoded or not, so for my use case I first have to scan the headers of all pages in the a column chunk to assert they're all dictionary encoded, if any of them are not (other than the dictionary page itself of course), I treat the column as not-dictionary encoded, meaning I'll read with a ColumnReader instead of a PageReader and let the library handle the variously encoded pages.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants