Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Splice Parquet Data #4155

Closed
tustvold opened this issue Apr 28, 2023 · 1 comment · Fixed by #4269
Closed

Splice Parquet Data #4155

tustvold opened this issue Apr 28, 2023 · 1 comment · Fixed by #4269
Assignees
Labels
enhancement Any new improvement worthy of a entry in the changelog parquet Changes to the parquet crate

Comments

@tustvold
Copy link
Contributor

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

A common request is to be able to combine parquet files together without re-encoding data (#557) (#4150). However, correctly translating the metadata is non-trivial, and requires care to ensure the relevant file offsets are correctly updated.

Describe the solution you'd like

I would like an API on SerializedRowGroupWriter that lets me append an existing ColumnChunk from another source. For example,

/// Splice a column from another file without decoding it
///
/// This can be used for efficiently concatenating or projecting parquet data
pub fn splice_column<R: ChunkReader>(&mut self, reader: &R, metadata: &ColumnChunkMetaData) -> Result<()> {

I originally debated making the signature

pub fn splice_column(&mut self, column: &dyn PageReader) -> Result<()> {

But this runs into a couple of problems

  • The PageReader returns uncompressed, decoded pages (although the value data is still encoded)
  • It isn't clear how to preserve the page index or any bloom filter information

I also debated allowing appending pages individually, however, in addition to the above problems it runs into:

  • A column chunk can only have a single dictionary page
  • The compression codec is specified at the column chunk level

The downside of the ChunkReader API is that potentially someone could pass a reader that doesn't match the ColumnChunkMetaData, which would result in an inconsistent parquet file. I'm inclined to think this isn't a problem, as there are plenty of other ways to generate an invalid "parquet" file 😅

Describe alternatives you've considered

Additional context

@tustvold tustvold added the enhancement Any new improvement worthy of a entry in the changelog label Apr 28, 2023
@tustvold tustvold self-assigned this Apr 28, 2023
tustvold added a commit to tustvold/arrow-rs that referenced this issue May 23, 2023
tustvold added a commit to tustvold/arrow-rs that referenced this issue May 23, 2023
tustvold added a commit to tustvold/arrow-rs that referenced this issue May 23, 2023
tustvold added a commit that referenced this issue May 24, 2023
* Add splice column API (#4155)

* Review feedback

* Re-encode offset index
alamb pushed a commit to alamb/arrow-rs that referenced this issue May 30, 2023
* Add splice column API (apache#4155)

* Review feedback

* Re-encode offset index
@tustvold tustvold added the parquet Changes to the parquet crate label Jun 2, 2023
@tustvold
Copy link
Contributor Author

tustvold commented Jun 2, 2023

label_issue.py automatically added labels {'parquet'} from #4265

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant