[FEAT] Streaming physical writes for native executor #2992

colin-ho · 2024-10-03T18:07:18Z

Streaming writes for swordfish (parquet + csv only). Iceberg and delta writes are here: #2966

Implement streaming writes as a blocking sink. Unpartitioned writes run with 1 worker, and Partitioned writes run with NUM_CPUs workers. As a drive by, made blocking sinks parallelizable.

Behaviour

Unpartitioned: Make writes to a TargetFileSizeWriter, which manages file sizes and row group sizes, as data is streamed in.
Partitioned: Partition data via a Dispatcher and send to workers based on the hash. Each worker runs a PartitionedWriter that manages partitioning by value, file sizes, and row group sizes.

Benchmarks:
I made a new benchmark suite in tests/benchmarks/test_streaming_writes.py, it tests writes of tpch lineitem to parquet/csv with/without partition columns and different file/rowgroup size. The streaming executor performs much better when there are partition columns, as seen in this screenshot. Without partition columns it is about the same, when target row group size / file size is decreased, it is slightly slower. Likely due to the fact that probably does more slicing, but will need to investigate more. Memory usage is the same for both.

Memory test on read->write parquet tpch lineitem sf1:
Native:

Python:

codspeed-hq · 2024-10-03T18:22:13Z

CodSpeed Performance Report

Merging #2992 will not alter performance

_{Comparing colin/streaming-physical-writes (b67499e) with main (301cd48)}

Summary

✅ 17 untouched benchmarks

src/daft-micropartition/src/lib.rs

codecov · 2024-10-03T18:44:23Z

Codecov Report

Attention: Patch coverage is 87.94848% with 131 lines in your changes missing coverage. Please review.

Project coverage is 77.96%. Comparing base (f966e02) to head (b67499e).
Report is 3 commits behind head on main.

Files with missing lines	Patch %	Lines
daft/io/writer.py	0.00%	74 Missing ⚠️
src/daft-table/src/lib.rs	0.00%	13 Missing ⚠️
src/daft-writers/src/partition.rs	87.87%	12 Missing ⚠️
src/daft-local-execution/src/dispatcher.rs	82.50%	7 Missing ⚠️
src/daft-writers/src/physical.rs	85.71%	7 Missing ⚠️
src/daft-writers/src/lib.rs	92.00%	4 Missing ⚠️
src/daft-core/src/utils/identity_hash_set.rs	0.00%	3 Missing ⚠️
src/daft-local-execution/src/buffer.rs	93.75%	3 Missing ⚠️
src/daft-local-execution/src/sinks/aggregate.rs	93.10%	2 Missing ⚠️
src/daft-local-execution/src/sinks/sort.rs	93.10%	2 Missing ⚠️
... and 4 more

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2992      +/-   ##
==========================================
- Coverage   78.87%   77.96%   -0.91%     
==========================================
  Files         624      633       +9     
  Lines       75923    77485    +1562     
==========================================
+ Hits        59881    60415     +534     
- Misses      16042    17070    +1028

Files with missing lines	Coverage Δ
daft/table/partitioning.py	`95.55% <100.00%> (+0.43%)`	⬆️
src/daft-io/src/lib.rs	`70.73% <ø> (-0.49%)`	⬇️
...-execution/src/intermediate_ops/intermediate_op.rs	`82.05% <100.00%> (-0.12%)`	⬇️
src/daft-local-execution/src/lib.rs	`19.77% <ø> (-73.09%)`	⬇️
src/daft-local-execution/src/run.rs	`88.54% <100.00%> (ø)`
...rc/daft-local-execution/src/sinks/blocking_sink.rs	`81.65% <100.00%> (+7.26%)`	⬆️
.../daft-local-execution/src/sinks/hash_join_build.rs	`96.66% <100.00%> (+1.66%)`	⬆️
src/daft-local-execution/src/sinks/write.rs	`100.00% <100.00%> (ø)`
src/daft-micropartition/src/ops/concat.rs	`97.91% <100.00%> (+0.41%)`	⬆️
src/daft-micropartition/src/python.rs	`78.23% <100.00%> (ø)`
... and 22 more

... and 15 files with indirect coverage changes

daft/io/writer.py

samster25 · 2024-10-28T04:41:15Z

daft/io/writer.py

+        raise NotImplementedError("Subclasses must implement this method.")
+
+    def write(self, table: MicroPartition):
+        if self.current_writer is None:


it would be better here to rely on override-able methods or properties.
so something like if self.current_writer() is None:
Ideally, you can center the logic here and then have the child classes implement the specifics for each file type

Ok I made the methods in FileWriterBase all abstract methods, so the child classes have their own implementation.

src/daft-micropartition/src/py_writers.rs

src/daft-micropartition/src/lib.rs

samster25 · 2024-10-28T05:33:39Z

src/daft-local-execution/src/writes/unpartitioned_write.rs

+        let mut current_writer: Option<Box<dyn FileWriter>> = None;
+        let mut current_file_idx = None;
+        while let Some((data, file_idx)) = input_receiver.recv().await {
+            if current_file_idx.is_none() || current_file_idx.unwrap() != file_idx {


Reviewed the general pattern of how we are approaching file writes and I think we can improve it with a better abstraction.

So we already have the trait FileWriter and trait FileWriterFactory

but imagine if we can start layering them.

So at the base we can have ParquetFileWriter but instead of implementing the row group batching in the executor you can instead implement RowBatcherWriter that takes in a FileWriterFactory and then gives you a new factory.

Then you can pass that factory into a TargetFileSizeWriter that will target the target file size for each file it's writing out. Finally that factory can be passed into a PartitionedWriter that will partition by value.

This pattern is pretty common for building adaptors for writers can be seen here in the Iceberg sdk
https://github.com/apache/iceberg/blob/2b55fef7cc2a249d864ac26d85a4923313d96a59/core/src/main/java/org/apache/iceberg/io/PartitionedWriter.java

The cool thing is that these Filewriters be be parametered at runtime so the executor can likely be very simple.

I think I get it. So PartitionedWriter is itself a FileWriter, which holds a WriterFactory which can generate TargetFileSizeWriter which is also a FileWriter and also holds a WriterFactory which can generate a RowBatchWriter, which is also a FileWriter etc etc etc.

Implemented this in latest commit, as a blocking sink.

src/daft-writers/src/batch.rs

…an use it too

daft/io/writer.py

samster25 · 2024-10-30T21:36:49Z

daft/io/writer.py

+            Table containing metadata about the written file, including path and partition values.
+        """
+        pass
+


we should also have a finalize method rather than overloading close to start a next file and closing the last file

I was actually intending for these Python writers to be non rotating. i.e. no writing after closing. They should be given a unique file_idx for the file_name generation upon construction, and unique set of partition_values.

I will add assertions and some comments to document this behaviour

samster25 · 2024-10-30T21:42:48Z

daft/io/writer.py

+        pass
+
+    @abstractmethod
+    def close(self) -> Table:


What if we name this something like start_next_file

Mirroring the above comment, the python file writers should not write after close

src/daft-local-execution/src/lib.rs

src/daft-local-execution/src/pipeline.rs

samster25 · 2024-10-30T22:53:46Z

src/daft-micropartition/src/micropartition.rs

@@ -1276,5 +1277,27 @@ impl Display for MicroPartition {
    }
 }

+impl Bufferable for MicroPartition {


Since no one else besides the execution crate is going to use this. What I would recommend is to do this in daft-local-execution to impl this trait on MicroPartition .
this is called the newtype pattern

https://doc.rust-lang.org/book/ch19-03-advanced-traits.html#using-the-newtype-pattern-to-implement-external-traits-on-external-types

At first i was intending for the targetbatchwriter to use the bufferable stuff as well, but i think it's simpler to have it implement it's own buffering logic. And let local execution have it's own buffering logic as well.

src/daft-writers/src/file.rs

samster25 · 2024-10-30T23:07:45Z

src/daft-writers/src/partition.rs

+                    }
+                },
+            );
+            match entry {


not a fan of using the raw pattern outside of the probe table.

Not blocking it for this PR but we should have some kind of way to create Scalars so we can just cram those into the hashmap

Would storing the string representation of the partition values be a better alternative for now?

Hmmm but then we'd have to do string conversion every time, so probably not.

Let's make an issue for this and tackle this after we create ScalarValue

samster25 · 2024-10-30T23:07:49Z

src/daft-writers/src/partition.rs

+/// to a separate file. It uses a map to keep track of the writers for each partition.
+struct PartitionedWriter {
+    per_partition_writers:
+        HashMap<IndexHash, Box<dyn FileWriter<Input = Arc<MicroPartition>, Result = Vec<Table>>>>,


tbh, we should do something like a ScalarValue that wraps the possible partition value types for our non performance critical stuff like here.

samster25

Great work!

samster25 · 2024-10-31T19:41:41Z

src/daft-writers/src/partition.rs

+                    }
+                },
+            );
+            match entry {


Let's make an issue for this and tackle this after we create ScalarValue

Streaming writes for swordfish (parquet + csv only). Iceberg and delta writes are here: Eventual-Inc#2966 Implement streaming writes as a blocking sink. Unpartitioned writes run with 1 worker, and Partitioned writes run with NUM_CPUs workers. As a drive by, made blocking sinks parallelizable. **Behaviour** - Unpartitioned: Make writes to a `TargetFileSizeWriter`, which manages file sizes and row group sizes, as data is streamed in. - Partitioned: Partition data via a `Dispatcher` and send to workers based on the hash. Each worker runs a `PartitionedWriter` that manages partitioning by value, file sizes, and row group sizes. **Benchmarks:** I made a new benchmark suite in `tests/benchmarks/test_streaming_writes.py`, it tests writes of tpch lineitem to parquet/csv with/without partition columns and different file/rowgroup size. The streaming executor performs much better when there are partition columns, as seen in this screenshot. Without partition columns it is about the same, when target row group size / file size is decreased, it is slightly slower. Likely due to the fact that probably does more slicing, but will need to investigate more. Memory usage is the same for both. <img width="1400" alt="Screenshot 2024-10-03 at 11 22 32 AM" src="https://github.com/user-attachments/assets/53b4d77d-553a-4181-8a4d-9eddaa3adaf7"> Memory test on read->write parquet tpch lineitem sf1: Native: <img width="1078" alt="Screenshot 2024-10-08 at 1 48 34 PM" src="https://github.com/user-attachments/assets/3eda33c6-9413-415f-b808-ac3c7437e269"> Python: <img width="1090" alt="Screenshot 2024-10-08 at 1 48 50 PM" src="https://github.com/user-attachments/assets/f92b9a9f-a3b5-408b-98d5-4ba2d66b7be4"> --------- Co-authored-by: Colin Ho <[email protected]> Co-authored-by: Colin Ho <[email protected]> Co-authored-by: Colin Ho <[email protected]>

Implements streaming Iceberg and Delta writes for swordfish. Most of the write scaffolding has already been implemented in #2992, this PR implements the Iceberg/Delta specific functionalities. A quick TLDR on swordfish writes: - All of the row group sizing, file sizing, partitioning, is now handled in the `daft-writer` crate. - Only the actual writing + flushing is currently handled via Pyarrow Parquet + Csv writers. We intend to build our own native writers in the future. Notes: - Modified the iceberg writes such that: 1. the plan now stores just the spec id + partition cols (we used to keep the whole partitionspec object in the plan but only use the id, maybe we planned on keeping it around for future work, not sure tho pls lmk) 2. I made the `add_missing_columns` stuff an explicit projection. It was a lil cleaner this way instead of having swordfish implement `add_missing_columns` internally. --------- Co-authored-by: Colin Ho <[email protected]> Co-authored-by: Colin Ho <[email protected]>

physical

4d5af1a

github-actions bot added the enhancement New feature or request label Oct 3, 2024

iceberg fix

823e889

colin-ho commented Oct 3, 2024

View reviewed changes

src/daft-micropartition/src/lib.rs Outdated Show resolved Hide resolved

Colin Ho and others added 4 commits October 3, 2024 14:17

rename to factory

3854cd2

dont use table str for hash

56c8416

cleanup

b662ad9

Merge branch main into colin/streaming-physical-writes

4db34f0

samster25 reviewed Oct 28, 2024

View reviewed changes

Colin Ho and others added 8 commits October 28, 2024 11:38

Merge branch main into colin/streaming-physical-writes

4f12640

feedback

9ed8e84

add comments

16f3f7d

make write and close abstract

11d9244

dispatcher

685416b

agg fix

85ab44c

move buffer

0ce9dcc

remove unused deps

03dcfcb

colin-ho requested a review from samster25 October 29, 2024 06:36

colin-ho commented Oct 29, 2024

View reviewed changes

src/daft-writers/src/batch.rs Outdated Show resolved Hide resolved

if we move all file writer stuff to daft writer then the ray runner c…

bc9c7fa

…an use it too

samster25 reviewed Oct 30, 2024

View reviewed changes

Colin Ho added 3 commits October 30, 2024 20:45

Merge branch main into colin/streaming-physical-writes

92c4d03

address feedback

072ae6e

add unit tests for batch and file writers

b67499e

samster25 self-requested a review October 31, 2024 19:32

samster25 approved these changes Oct 31, 2024

View reviewed changes

colin-ho merged commit 5fc9531 into main Oct 31, 2024
42 checks passed

colin-ho deleted the colin/streaming-physical-writes branch October 31, 2024 20:21

colin-ho mentioned this pull request Nov 1, 2024

[FEAT] Streaming Catalog Writes #3160

Merged

colin-ho mentioned this pull request Nov 6, 2024

[CHORE] Implement dispatcher for intermediate ops and streaming sink #3238

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEAT] Streaming physical writes for native executor #2992

[FEAT] Streaming physical writes for native executor #2992

colin-ho commented Oct 3, 2024 •

edited

Loading

codspeed-hq bot commented Oct 3, 2024 •

edited

Loading

codecov bot commented Oct 3, 2024 •

edited

Loading

samster25 Oct 28, 2024

colin-ho Oct 29, 2024

samster25 Oct 28, 2024

colin-ho Oct 29, 2024 •

edited

Loading

samster25 Oct 30, 2024

colin-ho Oct 31, 2024 •

edited

Loading

samster25 Oct 30, 2024

colin-ho Oct 31, 2024

samster25 Oct 30, 2024

colin-ho Oct 31, 2024

samster25 Oct 30, 2024

colin-ho Oct 31, 2024

colin-ho Oct 31, 2024

samster25 Oct 31, 2024

samster25 Oct 30, 2024

samster25 left a comment

samster25 Oct 31, 2024

[FEAT] Streaming physical writes for native executor #2992

[FEAT] Streaming physical writes for native executor #2992

Conversation

colin-ho commented Oct 3, 2024 • edited Loading

codspeed-hq bot commented Oct 3, 2024 • edited Loading

Merging #2992 will not alter performance

Summary

codecov bot commented Oct 3, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

colin-ho Oct 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

colin-ho Oct 31, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

samster25 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

colin-ho commented Oct 3, 2024 •

edited

Loading

codspeed-hq bot commented Oct 3, 2024 •

edited

Loading

codecov bot commented Oct 3, 2024 •

edited

Loading

colin-ho Oct 29, 2024 •

edited

Loading

colin-ho Oct 31, 2024 •

edited

Loading