Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid Buffering Arrow Data for Entire Row Group in parquet::ArrowWriter #3871

Closed
tustvold opened this issue Mar 15, 2023 · 3 comments · Fixed by #4280
Closed

Avoid Buffering Arrow Data for Entire Row Group in parquet::ArrowWriter #3871

tustvold opened this issue Mar 15, 2023 · 3 comments · Fixed by #4280
Assignees
Labels
arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog parquet Changes to the parquet crate

Comments

@tustvold
Copy link
Contributor

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

Currently ArrowWriter buffers up RecordBatch until it has enough rows to populate an entire row group, and then proceeds to write each column in turn to the output buffer.

Describe the solution you'd like

The encoded parquet data is often orders of magnitude smaller than the corresponding arrow data. The read path goes to great lengths to allow incremental reading of data within a row group. It may therefore be desirable to instead encode arrow data eagerly, writing each ColumnChunk to its own temporary buffer, and then stitching these back together.

This would allow writing larger row groups, whilst potentially consuming less memory in the arrow writer.

This would likely involve extending or possibly replacing SerializedRowGroupWriter to allow writing to the same column multiple times

Describe alternatives you've considered

We could not do this, parquet is inherently a read-optimised format and write performance may therefore be less of a priority for many workloads.

Additional context

@tustvold tustvold added the enhancement Any new improvement worthy of a entry in the changelog label Mar 15, 2023
@alamb
Copy link
Contributor

alamb commented May 22, 2023

This ticket will improve https://github.com/influxdata/influxdb_iox/issues/7783 -- thank you for filing it.

As part of this feature, I would like to request some user definable best effort limit of how much memory the parquet writer will buffer (so flush is a function of both "max_row_group_size" as well as "buffer_limit").

If for some reason that is not possible or advisable, exposing the currently buffered size would be ok too (so external users can implement the buffer limiting themselves)

@tustvold tustvold self-assigned this May 22, 2023
@tustvold
Copy link
Contributor Author

I think #4155 is a precursor to this, as it provides the necessary APIs to be able to encode the columns separately, and then stitch them together again. I therefore intend to work on it first

@alamb
Copy link
Contributor

alamb commented May 23, 2023

I wonder if you also might think about #1718 "encode the columns in parallel while writing parquet" while working on this.

This was referenced May 23, 2023
tustvold added a commit to tustvold/arrow-rs that referenced this issue May 25, 2023
tustvold added a commit to tustvold/arrow-rs that referenced this issue May 25, 2023
tustvold added a commit that referenced this issue May 29, 2023
…ad of RecordBatch (#3871) (#4280)

* Buffer Pages in ArrowWriter instead of RecordBatch (#3871)

* Review feedback

* Improved memory accounting

* Clippy
alamb pushed a commit to alamb/arrow-rs that referenced this issue May 30, 2023
…ad of RecordBatch (apache#3871) (apache#4280)

* Buffer Pages in ArrowWriter instead of RecordBatch (apache#3871)

* Review feedback

* Improved memory accounting

* Clippy
@tustvold tustvold added parquet Changes to the parquet crate arrow Changes to the arrow crate labels Jun 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog parquet Changes to the parquet crate
Projects
None yet
2 participants