Copy into partition by #5964

samansmink · 2023-01-23T09:49:37Z

This PR introduces a first version of the partitioned COPY operator. This PR builds on previous work from @lnkuiper adding the PartitionedColumnData (#4970) and the per thread output from @hannes (#5412)

To summarize, the Partitioned COPY:

Supports both CSV and Parquet
Currently fully materializes data during partitioning (to be improved in follow-up pr's)
Outputs 1 file per partition per thread (similar to PER_THREAD_OUTPUT flag for COPY #5412)

The partitioned write is used similarly to the PER_THREAD_OUTPUT feature:

COPY table TO '__TEST_DIR__/partitioned' (FORMAT PARQUET, PARTITION_BY (part_col_a, part_col_b));

this command will write files in this format, which is known as the hive partitioning scheme:

__TEST_DIR__/partitioned/part_col_a=<val>/part_col_b=<val>/data_<thread_number>.parquet

Partioned copy to S3 also works:

COPY table TO 's3://mah-bucket/partitioned' (FORMAT PARQUET, PARTITION_BY (part_col_a, part_col_b));

Finally, a check is performed for existing files/directories which is currently quite conservative (and on S3 will add a bit of latency). To disable this check and force writing, an ALLOW_OVERWRITE flag is added:

COPY table TO '__TEST_DIR__/partitioned' (FORMAT PARQUET, PARTITION_BY (part_col_a, part_col_b), ALLOW_OVERWRITE TRUE);

Note that this also works with the PER_THREAD_OUTPUT feature.

Implementation

To support these features, a new class HivePartitionedColumnData is introduced, which implements the PartitionedColumnData interface from #4970. The main complexity here is that the HivePartitionedColumnData class needs to be able to discover new partitions in parallel. For the RadixPartitioning that was already implemented, the number of partitions is known in advance and does not change. Partition discovery is handled by all threads while writing tuples to their local HivePartitionedColumnData. To prevent expensive locking when synchronizing the partition discovery, each thread will have a local partition map to do quick lookups during partitioning, with a shared global state where new partitions are added. This means that only when adding new partitions to the global state a lock is required. Since this partitioned write is not expected to scale to super large amounts of partitions anyway, this should work well. Due to this shared state the partition indices between the thread local HivePartitionedColumnData objects will remain in sync.

Benchmarks

Here's some rough benchmarks to give an indication of the performance overhead on a M1 macbook:

COPY lineitem_sf1 to file.parquet w/ 8 threads	Avg	Relative to fastest
Regular copy	7.50	443.98%
Disable preserve order (#5756)	1.40	1.64%
Threads (#5412)	1.38	0.00%
Threads + disable order preserving	1.42	2.73%
Hive Partitioned (4 partitions over 2 cols)	1.62	17.71%
Hive Partitioned (28 partitions over 3 cols)	1.92	38.93%
Hive Partitioned (84 paritions over 2 cols)	1.89	36.93%
Hive Partitioned (432 paritions over 3 cols)	2.58	86.83%
Hive Partitioned (1160 paritions over 5 cols)	5.67	310.78%
Hive Partitioned (2526 partitions over 1 col)	11.15	708.10%

Note that performance for low amounts of partitions is very good. Higher amounts of partitions get pretty bad, but this is most likely due to the fact that at this partition count, the resulting files only contain very few tuples, leading to large IO overhead. I would expect that for larger files the relative overhead will be lower, but more benchmarking here is required.

Transform partition columns before writing

Note that partition values are converted to strings. There are probably many edge cases where this won't work nicely. However, just using SQL in your copy statement you can transform the partition by columns however you want. For example to add a prefix to an existing column call part_col you could do:

COPY (SELECT * EXCLUDE (part_col), 'prefix-'::VARCHAR || part_col::VARCHAR as part_col FROM test) 
TO '__TEST_DIR__/partitioned' (FORMAT PARQUET, PARTITION_BY (part_col));

Limitations

The partitioned write fully materializes, so it will quickly fill the available memory for large tables. This is not necessarily a blocker as the buffer manager can offload to disk. This will however mean that enough local disk space is required and for large files the data is:

stored in memory
offloaded to disk
read from disk
written to the final partition file

This is not ideal, but will still be good enough for many use cases.

Future work

The ideal partitioning COPY would produce 1 file per partition while providing mechanisms to limit memory usage and not require full materialization.

First step towards this goal is to make a streaming variant that can flush partitions as partition tuple limits (or possibly global operator limits) are reached. This would produce a single file per partition, per flush of that partition.

The second step would be to not close files after flushing, allowing multiple flushes to a single file. With this we would achieve the desired behaviour where we can produce a single file per partition in a streaming fashion.

We have some nice idea's on implementing the above, so hopefully two more PR's coming up soon-ish :)

Bug in parquet filter pushdown where the filters would be malformed

samansmink · 2023-01-23T10:04:01Z

Perfect, now just add partitions column 🤡😜

@djouallah there ya go 😁

lnkuiper

The code looks great, and the performance numbers are awesome! Performance seems to degrade quite slowly, even with ~400 partitions. In the example with 2526 partitions, I think the performance degradation is not only due to I/O, but also due to writing the data to the in-memory partitions. With so few partitions, we are basically appending 1-2 tuples to each of the partitions at a time, which brings us down to tuple-at-a-time processing 😓 not much you can do about this, though. If you're interested, you could try sorting the data by the column that you're partitioning on. This should greatly speed up the partitioning code since more tuples are being written to a partition at a time.

Just one comment: Is it worth making a .benchmark file so that the CI checks that the performance does not degrade?

tobilg · 2023-01-24T10:39:54Z

Great work, this will be a killer feature for DuckDB! Do you have plans to support writing to S3 as well in the future?

samansmink · 2023-01-24T10:49:07Z

@lnkuiper thanks for the review, will add some benchmarks later today!

@tobilg wow somehow I completely forgot to test this.. Thanks for the comment 😅 This feature should just work™ with S3, but it may need some additional tweaks. Ill add the tests later today. An important use case for the partitioning feature is certainly to be able to repartition a dataset from and to S3 in a fully streaming fashion.

tobilg · 2023-01-24T11:07:52Z

This feature should just work™ with S3, but it may need some additional tweaks. Ill add the tests later today. An important use case for the partitioning feature is certainly to be able to repartition a dataset from and to S3 in a fully streaming fashion

Thanks, that's great news! Will there be a limitation regarding the number of partitions?

samansmink · 2023-01-24T11:21:34Z

@tobilg There's for sure going to be a prohibitive overhead when partitioning with very large amounts of partitions, see also the comment by @lnkuiper and the benchmarks in this PR. However, with the optimizations we have planned (and perhaps the sorting optimization proposed by @lnkuiper) we should be able to support decently large amounts of partitions.

Mytherin · 2023-01-24T11:31:25Z

With so few partitions, we are basically appending 1-2 tuples to each of the partitions at a time, which brings us down to tuple-at-a-time processing 😓 not much you can do about this,

This is similar to the problem we encounter in hash tables - what we do about this there is that we divide the partitioning in two passes. First we figure out for each tuple where it should be written - then we scatter to multiple locations at once. Perhaps something similar can be done to solve this problem with partitioning?

tobilg · 2023-01-24T11:31:52Z

we should be able to support decently large amounts of partitions.

That'll be a great and very useful feature then! Other common ways on AWS to repartition parquet data in S3 have pretty tight limits, e.g. Athena CTAS queries can only write to 100 partitions simulaneously. One can work around this by using multiple passed, or pre-partitioning the data, but the DX is not great.

lnkuiper · 2023-01-24T12:57:00Z

@Mytherin This is a good idea. It is tricky to implement, however, because we are not scattering to a row layout but appending to an intermediate buffer, which is a DataChunk. Maybe we can discuss the options tomorrow.

samansmink · 2023-01-24T14:38:03Z

@lnkuiper @Mytherin I added two simple microbenchmarks which i've also added to the .github/regression/micro.csv however, since these write to files i'm not 100% this will work well as it may trigger false positives for regression

@tobilg Added the S3 tests. There were also some minor issues that are now resolved. So S3 seems to work fine!

Mytherin

Thanks for the PR! Very exciting feature. Some comments from my side:

src/common/hive_partitioning.cpp

src/execution/operator/persistent/physical_copy_to_file.cpp

test/sql/copy/hive_filter_pushdown_bug.test

third_party/libpg_query/src_backend_parser_scan.cpp

Mytherin

Thanks for the fixes! One minor comment, otherwise looks great:

src/common/bind_helpers.cpp

Mytherin

Thanks for the updates - LGTM! Ready to merge after CI passes.

samansmink · 2023-02-01T15:37:54Z

@Mytherin
there's 2 failures left:

codecov fails with http error, can be restarted but it has passed in previous runs with little changes so i think its fine
odbc failure on linux aarch64 is also http thing and unrelated, it also has succeeded in previous runs

Mytherin · 2023-02-01T16:00:43Z

Thanks! Indeed everything looks good.

samansmink added 25 commits January 6, 2023 14:08

WIP: basic outline of Hive partitioned columndatacollection

1769063

WIP: initial version of hive partitioned write sortof working

428a270

Merge branch 'hive-allow-col-in-file' into copy-into-partition-by

0bb9cff

basic partitioned write working

a542acd

added basic test for partitioned write

6a730a3

Merge branch 'master' into copy-into-partition-by

544ed69

reran bison

9f7510e

cleanup

3c01216

format

9d7cafd

missed these

ad9b1e3

cleanup, writing more tests

821d99b

Merge branch 'hive-allow-col-in-file' into copy-into-partition-by

1848ebf

cleanup, some more tests

6f09349

format

3566d35

more tests and fix for bug

f4c493e

Bug in parquet filter pushdown where the filters would be malformed

add test for hive filter pushdown bugfix

4023c3a

format

d758285

more test fixes

b85e122

various small ci issues

0c7a76a

fix make tidy issues

e06b192

fix csv related issue in test

439cd6a

missing move

78c1f49

more ci fixes

b6d0ee7

Merge branch 'master' into copy-into-partition-by

9d27abc

forgot to run flex

5356f63

lnkuiper approved these changes Jan 23, 2023

View reviewed changes

Merge branch 'master' into copy-into-partition-by

49ab898

samansmink added 2 commits January 24, 2023 14:33

S3 added to partitioned copy

a91b2e6

added benchmarks

d82a46a

samansmink added 2 commits January 25, 2023 11:11

Merge branch 'master' into copy-into-partition-by

a688ca6

Removed faulty default param

f502134

Mytherin reviewed Jan 25, 2023

View reviewed changes

samansmink added 7 commits January 26, 2023 11:05

using PathSeparator properly

75466e2

revert file as it was not necessary to regenerate

6d0b986

hint overwrite option to user in IO errors

e278412

fix unnecessary locking

6390a47

format, update comment

366f91d

Merge branch 'master' into copy-into-partition-by

3597fca

properly revert, without indentation messup

2d6669a

samansmink requested a review from Mytherin January 27, 2023 09:24

Mytherin reviewed Jan 27, 2023

View reviewed changes

src/common/bind_helpers.cpp Outdated Show resolved Hide resolved

samansmink added 2 commits January 27, 2023 16:18

ignore capitalization for partition column

ea49606

Merge branch 'master' into copy-into-partition-by

d953056

Mytherin approved these changes Jan 31, 2023

View reviewed changes

format

0f9321b

Mytherin merged commit 068c0fd into duckdb:master Feb 1, 2023

tobilg mentioned this pull request Feb 16, 2023

Add COPY TO PARTITION BY details duckdb/duckdb-web#557

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Copy into partition by #5964

Copy into partition by #5964

samansmink commented Jan 23, 2023 •

edited

Loading

samansmink commented Jan 23, 2023

lnkuiper left a comment

tobilg commented Jan 24, 2023

samansmink commented Jan 24, 2023

tobilg commented Jan 24, 2023

samansmink commented Jan 24, 2023

Mytherin commented Jan 24, 2023

tobilg commented Jan 24, 2023

lnkuiper commented Jan 24, 2023

samansmink commented Jan 24, 2023

Mytherin left a comment

Mytherin left a comment

Mytherin left a comment

samansmink commented Feb 1, 2023

Mytherin commented Feb 1, 2023

Copy into partition by #5964

Copy into partition by #5964

Conversation

samansmink commented Jan 23, 2023 • edited Loading

Implementation

Benchmarks

Transform partition columns before writing

Limitations

Future work

samansmink commented Jan 23, 2023

lnkuiper left a comment

Choose a reason for hiding this comment

tobilg commented Jan 24, 2023

samansmink commented Jan 24, 2023

tobilg commented Jan 24, 2023

samansmink commented Jan 24, 2023

Mytherin commented Jan 24, 2023

tobilg commented Jan 24, 2023

lnkuiper commented Jan 24, 2023

samansmink commented Jan 24, 2023

Mytherin left a comment

Choose a reason for hiding this comment

Mytherin left a comment

Choose a reason for hiding this comment

Mytherin left a comment

Choose a reason for hiding this comment

samansmink commented Feb 1, 2023

Mytherin commented Feb 1, 2023

samansmink commented Jan 23, 2023 •

edited

Loading