-
Notifications
You must be signed in to change notification settings - Fork 759
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Splice Parquet Data #4155
Labels
enhancement
Any new improvement worthy of a entry in the changelog
parquet
Changes to the parquet crate
Comments
tustvold
added
the
enhancement
Any new improvement worthy of a entry in the changelog
label
Apr 28, 2023
This was referenced May 23, 2023
tustvold
added a commit
to tustvold/arrow-rs
that referenced
this issue
May 23, 2023
tustvold
added a commit
to tustvold/arrow-rs
that referenced
this issue
May 23, 2023
tustvold
added a commit
to tustvold/arrow-rs
that referenced
this issue
May 23, 2023
tustvold
added a commit
that referenced
this issue
May 24, 2023
alamb
pushed a commit
to alamb/arrow-rs
that referenced
this issue
May 30, 2023
* Add splice column API (apache#4155) * Review feedback * Re-encode offset index
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
enhancement
Any new improvement worthy of a entry in the changelog
parquet
Changes to the parquet crate
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
A common request is to be able to combine parquet files together without re-encoding data (#557) (#4150). However, correctly translating the metadata is non-trivial, and requires care to ensure the relevant file offsets are correctly updated.
Describe the solution you'd like
I would like an API on
SerializedRowGroupWriter
that lets me append an existing ColumnChunk from another source. For example,I originally debated making the signature
But this runs into a couple of problems
I also debated allowing appending pages individually, however, in addition to the above problems it runs into:
The downside of the
ChunkReader
API is that potentially someone could pass a reader that doesn't match theColumnChunkMetaData
, which would result in an inconsistent parquet file. I'm inclined to think this isn't a problem, as there are plenty of other ways to generate an invalid "parquet" file 😅Describe alternatives you've considered
Additional context
The text was updated successfully, but these errors were encountered: