Splice Parquet Data #4155

tustvold · 2023-04-28T15:39:59Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

A common request is to be able to combine parquet files together without re-encoding data (#557) (#4150). However, correctly translating the metadata is non-trivial, and requires care to ensure the relevant file offsets are correctly updated.

Describe the solution you'd like

I would like an API on SerializedRowGroupWriter that lets me append an existing ColumnChunk from another source. For example,

/// Splice a column from another file without decoding it
///
/// This can be used for efficiently concatenating or projecting parquet data
pub fn splice_column<R: ChunkReader>(&mut self, reader: &R, metadata: &ColumnChunkMetaData) -> Result<()> {

I originally debated making the signature

pub fn splice_column(&mut self, column: &dyn PageReader) -> Result<()> {

But this runs into a couple of problems

The PageReader returns uncompressed, decoded pages (although the value data is still encoded)
It isn't clear how to preserve the page index or any bloom filter information

I also debated allowing appending pages individually, however, in addition to the above problems it runs into:

A column chunk can only have a single dictionary page
The compression codec is specified at the column chunk level

The downside of the ChunkReader API is that potentially someone could pass a reader that doesn't match the ColumnChunkMetaData, which would result in an inconsistent parquet file. I'm inclined to think this isn't a problem, as there are plenty of other ways to generate an invalid "parquet" file 😅

Describe alternatives you've considered

Additional context

The text was updated successfully, but these errors were encountered:

* Add splice column API (#4155) * Review feedback * Re-encode offset index

* Add splice column API (apache#4155) * Review feedback * Re-encode offset index

tustvold · 2023-06-02T15:04:04Z

label_issue.py automatically added labels {'parquet'} from #4265

tustvold added the enhancement Any new improvement worthy of a entry in the changelog label Apr 28, 2023

tustvold self-assigned this Apr 28, 2023

This was referenced May 23, 2023

Avoid Buffering Arrow Data for Entire Row Group in parquet::ArrowWriter #3871

Closed

Convert parquet metadata back to builders #4265

Merged

tustvold added a commit to tustvold/arrow-rs that referenced this issue May 23, 2023

Add splice column API (apache#4155)

f43609c

tustvold mentioned this issue May 23, 2023

Add Append Column API (#4155) #4269

Merged

tustvold added a commit to tustvold/arrow-rs that referenced this issue May 23, 2023

Add splice column API (apache#4155)

1fd36f8

tustvold added a commit to tustvold/arrow-rs that referenced this issue May 23, 2023

Add splice column API (apache#4155)

04ed4be

tustvold closed this as completed in #4269 May 24, 2023

tustvold added a commit that referenced this issue May 24, 2023

Add splice column API (#4155) (#4269)

58e2c1c

* Add splice column API (#4155) * Review feedback * Re-encode offset index

alamb pushed a commit to alamb/arrow-rs that referenced this issue May 30, 2023

Add splice column API (apache#4155) (apache#4269)

62c6cbb

* Add splice column API (apache#4155) * Review feedback * Re-encode offset index

tustvold added the parquet Changes to the parquet crate label Jun 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Splice Parquet Data #4155

Splice Parquet Data #4155

tustvold commented Apr 28, 2023

tustvold commented Jun 2, 2023

Splice Parquet Data #4155

Splice Parquet Data #4155

Comments

tustvold commented Apr 28, 2023

tustvold commented Jun 2, 2023