-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Copy into partition by #5964
Copy into partition by #5964
Conversation
Bug in parquet filter pushdown where the filters would be malformed
@djouallah there ya go 😁 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks great, and the performance numbers are awesome! Performance seems to degrade quite slowly, even with ~400 partitions. In the example with 2526 partitions, I think the performance degradation is not only due to I/O, but also due to writing the data to the in-memory partitions. With so few partitions, we are basically appending 1-2 tuples to each of the partitions at a time, which brings us down to tuple-at-a-time processing 😓 not much you can do about this, though. If you're interested, you could try sorting the data by the column that you're partitioning on. This should greatly speed up the partitioning code since more tuples are being written to a partition at a time.
Just one comment: Is it worth making a .benchmark
file so that the CI checks that the performance does not degrade?
Great work, this will be a killer feature for DuckDB! Do you have plans to support writing to S3 as well in the future? |
@lnkuiper thanks for the review, will add some benchmarks later today! @tobilg wow somehow I completely forgot to test this.. Thanks for the comment 😅 This feature should just work™ with S3, but it may need some additional tweaks. Ill add the tests later today. An important use case for the partitioning feature is certainly to be able to repartition a dataset from and to S3 in a fully streaming fashion. |
Thanks, that's great news! Will there be a limitation regarding the number of partitions? |
@tobilg There's for sure going to be a prohibitive overhead when partitioning with very large amounts of partitions, see also the comment by @lnkuiper and the benchmarks in this PR. However, with the optimizations we have planned (and perhaps the sorting optimization proposed by @lnkuiper) we should be able to support decently large amounts of partitions. |
This is similar to the problem we encounter in hash tables - what we do about this there is that we divide the partitioning in two passes. First we figure out for each tuple where it should be written - then we scatter to multiple locations at once. Perhaps something similar can be done to solve this problem with partitioning? |
That'll be a great and very useful feature then! Other common ways on AWS to repartition parquet data in S3 have pretty tight limits, e.g. Athena CTAS queries can only write to 100 partitions simulaneously. One can work around this by using multiple passed, or pre-partitioning the data, but the DX is not great. |
@Mytherin This is a good idea. It is tricky to implement, however, because we are not scattering to a row layout but appending to an intermediate buffer, which is a |
@lnkuiper @Mytherin I added two simple microbenchmarks which i've also added to the @tobilg Added the S3 tests. There were also some minor issues that are now resolved. So S3 seems to work fine! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! Very exciting feature. Some comments from my side:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fixes! One minor comment, otherwise looks great:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates - LGTM! Ready to merge after CI passes.
@Mytherin
|
Thanks! Indeed everything looks good. |
This PR introduces a first version of the partitioned COPY operator. This PR builds on previous work from @lnkuiper adding the
PartitionedColumnData
(#4970) and the per thread output from @hannes (#5412)To summarize, the Partitioned COPY:
The partitioned write is used similarly to the PER_THREAD_OUTPUT feature:
this command will write files in this format, which is known as the hive partitioning scheme:
Partioned copy to S3 also works:
Finally, a check is performed for existing files/directories which is currently quite conservative (and on S3 will add a bit of latency). To disable this check and force writing, an ALLOW_OVERWRITE flag is added:
Note that this also works with the PER_THREAD_OUTPUT feature.
Implementation
To support these features, a new class
HivePartitionedColumnData
is introduced, which implements thePartitionedColumnData
interface from #4970. The main complexity here is that theHivePartitionedColumnData
class needs to be able to discover new partitions in parallel. For the RadixPartitioning that was already implemented, the number of partitions is known in advance and does not change. Partition discovery is handled by all threads while writing tuples to their localHivePartitionedColumnData
. To prevent expensive locking when synchronizing the partition discovery, each thread will have a local partition map to do quick lookups during partitioning, with a shared global state where new partitions are added. This means that only when adding new partitions to the global state a lock is required. Since this partitioned write is not expected to scale to super large amounts of partitions anyway, this should work well. Due to this shared state the partition indices between the thread localHivePartitionedColumnData
objects will remain in sync.Benchmarks
Here's some rough benchmarks to give an indication of the performance overhead on a M1 macbook:
Note that performance for low amounts of partitions is very good. Higher amounts of partitions get pretty bad, but this is most likely due to the fact that at this partition count, the resulting files only contain very few tuples, leading to large IO overhead. I would expect that for larger files the relative overhead will be lower, but more benchmarking here is required.
Transform partition columns before writing
Note that partition values are converted to strings. There are probably many edge cases where this won't work nicely. However, just using SQL in your copy statement you can transform the partition by columns however you want. For example to add a prefix to an existing column call part_col you could do:
Limitations
The partitioned write fully materializes, so it will quickly fill the available memory for large tables. This is not necessarily a blocker as the buffer manager can offload to disk. This will however mean that enough local disk space is required and for large files the data is:
This is not ideal, but will still be good enough for many use cases.
Future work
The ideal partitioning COPY would produce 1 file per partition while providing mechanisms to limit memory usage and not require full materialization.
First step towards this goal is to make a streaming variant that can flush partitions as partition tuple limits (or possibly global operator limits) are reached. This would produce a single file per partition, per flush of that partition.
The second step would be to not close files after flushing, allowing multiple flushes to a single file. With this we would achieve the desired behaviour where we can produce a single file per partition in a streaming fashion.
We have some nice idea's on implementing the above, so hopefully two more PR's coming up soon-ish :)