WIP: changes to upstream DF, in order to enable parallelized writes with ParquetSink #11

wiedld · 2024-04-24T01:46:16Z

This branches off the most recent DF version in iox.

Then adds additional patches in order to make the parallelize ParquetSink work with our metadata use case.

Background.

IOx adds our own metadata in the parquet file. Currently, we do so using the WriterProperties with the ArrowWriter. When we did the ParquetSink parallelized write PoC, we also provided this iox metadata by adding it to the WriterProperties given to the ParquetSink::write_all().

The approach used in the PoC is no longer viable. There was a change to unify the different writer options across sink types, specifically to make COPY TO and create external table have a uniform configuration. Users can now specify the configuration with the query (e.g. COPY <src> TO <sink> (<config_options>)). This was a good high level change; however, we would like to iterate on this approach.

The current implementation derives the writer properties from the TableParquetOptions. This conversion always sets the sorting_columns and user-defined kv_metadata as None, as demonstrated in the first commit. We have several choices in how to return the ability to set these options -- choices which are commented below in this WIP.

…set of TableParquetOptions

wiedld · 2024-04-24T01:54:10Z

datafusion/common/src/file_options/parquet_writer.rs

    type Error = DataFusionError;

    fn try_from(parquet_options: &TableParquetOptions) -> Result<Self> {
-        let parquet_session_options = &parquet_options.global;
+        let ParquetOptions {


The ParquetOptions are the configuration which can be provided within a SQL query, and therefore are intended for use in an easily parsible format (refer to the ConfigField trait and associated macros in the linked file).

The sorting_columns may lend itself to this use case, of being provided within a SQL query and being easier to parse. However, the same is not true for the user-provided kv_metadata.

I have a question regarding the use case for the WriterProperties sorting_columns. It's listed in the parquet interface; is this referring to a per-row-group applied sorting that only occurs on write? Is there a use case for datafusion, given that we already sort earlier in the batch stream?

I have a question regarding the use case for the WriterProperties sorting_columns. It's listed in the parquet interface;

In theory it is supposed to be used to let readers infer information from the file. I don't know how widely it is written or used by other parquet readers/writers.

IOx stores its sort information in its own metadata, so I think setting the fields in the parquet metadata could be a separate project

…ile sink

alamb

The basic idea looks good to me here. 👍 @wiedld

One thing that might be worth considering is how to test this

For example, do we want to add a SQL level API like

COPY (values (1), (2)) TO 'foo.parquet' OPTION (metadata "foo:bar").)

(probably not that exact syntax)

Or maybe we just expose the APIs for use programatically

alamb · 2024-04-24T10:41:27Z

datafusion/common/src/config.rs

+    /// Optional, additional metadata to be inserted into the key_value_metadata
+    /// for the written [`FileMetaData`](https://docs.rs/parquet/latest/parquet/file/metadata/struct.FileMetaData.html).
+    #[cfg(feature = "parquet")]
+    pub key_value_metadata: Option<Vec<KeyValue>>,


I think Key/Value is just owned strings. https://docs.rs/parquet/latest/parquet/format/struct.KeyValue.html

So we could avoid the cfgs by doing something like

Suggested change

pub key_value_metadata: Option<Vec<KeyValue>>,

pub key_value_metadata: HashMap<String, Option<String>>

ANd then translating that to KeyValues during the write

This would also make the protobuf easier to handle (just follow the encoding for HashMaps)

alamb · 2024-04-24T10:43:17Z

datafusion/common/src/file_options/parquet_writer.rs

    type Error = DataFusionError;

    fn try_from(parquet_options: &TableParquetOptions) -> Result<Self> {
-        let parquet_session_options = &parquet_options.global;
+        let ParquetOptions {


I have a question regarding the use case for the WriterProperties sorting_columns. It's listed in the parquet interface;

In theory it is supposed to be used to let readers infer information from the file. I don't know how widely it is written or used by other parquet readers/writers.

IOx stores its sort information in its own metadata, so I think setting the fields in the parquet metadata could be a separate project

alamb · 2024-04-24T10:43:33Z

datafusion/common/src/file_options/parquet_writer.rs

+            reorder_filters: _,
+            allow_single_file_parallelism: _,
+            maximum_parallel_row_group_writers: _,
+            maximum_buffered_record_batches_per_stream: _,


This is a nice change

alamb · 2024-04-24T18:05:01Z

👌

…ement

wiedld · 2024-04-24T19:07:00Z

One thing that might be worth considering is how to test this

For example, do we want to add a SQL level API like

Added sqllogictests, which did require a change for proper config after statement parse. Syntax is almost exactly as requested by @alamb .

appletreeisyellow · 2024-04-25T16:18:30Z

This patch will be included in this DataFusion update: https://github.com/influxdata/influxdb_iox/pull/10780. It is in queue behind two PRs: https://github.com/influxdata/influxdb_iox/pull/10764 and https://github.com/influxdata/influxdb_iox/pull/10772. If these two PR got deployed without issue, I plan to merge https://github.com/influxdata/influxdb_iox/pull/10780 tomorrow morning

appletreeisyellow · 2024-04-29T16:44:09Z

#13 brought in the upstream change (apache#10224 / apache@9c8873a), so closing this one

… `interval` (apache#11466) * Unparser rule for datatime cast (#10) * use timestamp as the identifier for date64 * rename * implement CustomDialectBuilder * fix * dialect with interval style (#11) --------- Co-authored-by: Phillip LeBlanc <[email protected]> * fmt * clippy * doc * Update datafusion/sql/src/unparser/expr.rs Co-authored-by: Andrew Lamb <[email protected]> * update the doc for CustomDialectBuilder * fix doc test --------- Co-authored-by: Phillip LeBlanc <[email protected]> Co-authored-by: Andrew Lamb <[email protected]>

chore: make explicit what ParquetWriterOptions are created from a sub…

391e074

…set of TableParquetOptions

github-actions bot added the core label Apr 24, 2024

wiedld commented Apr 24, 2024

View reviewed changes

refactor: restore the ability to add kv metadata into the generated f…

60fbdac

…ile sink

wiedld force-pushed the patch-for-10392 branch from dd38fd1 to 60fbdac Compare April 24, 2024 02:34

alamb reviewed Apr 24, 2024

View reviewed changes

wiedld force-pushed the patch-for-10392 branch from d9fb4f7 to 10b6287 Compare April 24, 2024 18:55

github-actions bot added the sqllogictest label Apr 24, 2024

wiedld added 2 commits April 24, 2024 12:02

refactor: use hashmap instead of KeyValue, to avoid dependency requir…

c964df5

…ement

test: demomnstrate API contract for metadata TableParquetOptions

8beb16a

wiedld force-pushed the patch-for-10392 branch from 10b6287 to 8beb16a Compare April 24, 2024 19:04

appletreeisyellow mentioned this pull request Apr 24, 2024

WIP: apply upstream commit for "enable parallelized writes with ParquetSink" #13

Closed

appletreeisyellow closed this Apr 29, 2024

appletreeisyellow deleted the patch-for-10392 branch April 29, 2024 16:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: changes to upstream DF, in order to enable parallelized writes with ParquetSink #11

WIP: changes to upstream DF, in order to enable parallelized writes with ParquetSink #11

wiedld commented Apr 24, 2024 •

edited

Loading

wiedld Apr 24, 2024 •

edited

Loading

wiedld Apr 24, 2024 •

edited

Loading

alamb Apr 24, 2024

alamb left a comment

alamb Apr 24, 2024

alamb Apr 24, 2024

alamb Apr 24, 2024

alamb commented Apr 24, 2024

wiedld commented Apr 24, 2024 •

edited

Loading

appletreeisyellow commented Apr 25, 2024

appletreeisyellow commented Apr 29, 2024

	pub key_value_metadata: Option<Vec<KeyValue>>,
	pub key_value_metadata: HashMap<String, Option<String>>

WIP: changes to upstream DF, in order to enable parallelized writes with ParquetSink #11

WIP: changes to upstream DF, in order to enable parallelized writes with ParquetSink #11

Conversation

wiedld commented Apr 24, 2024 • edited Loading

Background.

wiedld Apr 24, 2024 • edited Loading

Choose a reason for hiding this comment

wiedld Apr 24, 2024 • edited Loading

Choose a reason for hiding this comment

alamb Apr 24, 2024

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

alamb Apr 24, 2024

Choose a reason for hiding this comment

alamb Apr 24, 2024

Choose a reason for hiding this comment

alamb Apr 24, 2024

Choose a reason for hiding this comment

alamb commented Apr 24, 2024

wiedld commented Apr 24, 2024 • edited Loading

appletreeisyellow commented Apr 25, 2024

appletreeisyellow commented Apr 29, 2024

wiedld commented Apr 24, 2024 •

edited

Loading

wiedld Apr 24, 2024 •

edited

Loading

wiedld Apr 24, 2024 •

edited

Loading

wiedld commented Apr 24, 2024 •

edited

Loading