[FEAT] iceberg writes unpartitioned #2016

samster25 · 2024-03-15T01:45:30Z

No description provided.

samster25 · 2024-03-19T07:45:56Z

daft/dataframe/dataframe.py

+        # We perform the merge here since IcebergTable is not pickle-able
+        merge = _MergingSnapshotProducer(operation=operation, table=table)
+
+        builder = self._builder.write_iceberg(table)


@Fokko can you take a look to see if this is fine?

Since we can't pickle the Iceberg Table and we still want distributed writes, we're having daft write out the parquet files and then return pyiceberg DataFiles in the distributed dataframe. Then we commit the data files with _MergingSnapshotProducer.

Hey @samster25

Since we can't pickle the Iceberg Table and we still want distributed writes, we're having daft write out the parquet files and then return pyiceberg DataFiles in the distributed dataframe.

That sounds like a good approach!

Then we commit the data files with _MergingSnapshotProducer.

Typically you want to use a higher level API. For example the transaction API: https://github.com/apache/iceberg-python/blob/a7f207f7e5831b3be02bd023c4b33babc3ea13f6/pyiceberg/table/__init__.py#L1153-L1161

Great!

Typically you want to use a higher level API. For example the transaction API: https://github.com/apache/iceberg-python/blob/a7f207f7e5831b3be02bd023c4b33babc3ea13f6/pyiceberg/table/__init__.py#L1153-L1161

That would be much cleaner but unfortunately it's not in the 0.6.0 release!
https://github.com/apache/iceberg-python/blob/cc449266e7fe0e97f23e61b3c732b75a0d0a8dec/pyiceberg/table/__init__.py#L185

samster25 · 2024-03-19T07:47:42Z

daft/table/table_io.py

+    return pa.table(columns, schema=input_schema)
+
+
+def write_iceberg(


@Fokko This is where we are writing out the iceberg files. We should be correctly writing out the field_id via schema_to_pyarrow

That's the most important part I was looking for. You can easily check this using parq:

parq 00000-0-39ec4caa-4d45-46b9-a6e9-c35cfb4e9290-0.parquet --schema # Schema <pyarrow._parquet.ParquetSchema object at 0x131ca1c80> required group field_id=1 schema { optional double field_id=2 lat; optional double field_id=3 long; }

Fokko · 2024-03-19T08:14:07Z

daft/dataframe/dataframe.py

+        else:
+            raise ValueError(f"Only support `append` or `overwrite` mode. {mode} is unsupported")
+
+        # We perform the merge here since IcebergTable is not pickle-able


In the works: apache/iceberg-python#513

Fokko · 2024-03-19T08:21:12Z

daft/table/table_io.py

@@ -527,3 +520,185 @@ def file_visitor(written_file, i=i):
        for c_name in partition_values.column_names():
            data_dict[c_name] = partition_values.get_column(c_name).take(partition_idx_series)
    return MicroPartition.from_pydict(data_dict)
+
+
+def coerce_pyarrow_table_to_schema(pa_table: pa.Table, input_schema: pa.Schema) -> pa.Table:


I find this very frustrating of Arrow where you cannot just cast the table. I did already quite a bit of work on this on the Arrow side, but is still not complete :(

Fokko · 2024-03-19T08:22:45Z

daft/table/table_io.py

+    # TODO: these should be populate by `properties` but pyiceberg doesn't support them yet
+    target_file_size = 512 * 1024 * 1024
+    TARGET_ROW_GROUP_SIZE = 128 * 1024 * 1024


https://github.com/apache/iceberg-python/blob/a7f207f7e5831b3be02bd023c4b33babc3ea13f6/pyiceberg/table/__init__.py#L184 :)

codecov · 2024-03-20T05:27:55Z

Codecov Report

Attention: Patch coverage is 23.87097% with 118 lines in your changes are missing coverage. Please review.

Project coverage is 81.31%. Comparing base (d2f28d6) to head (0a620a4).
Report is 1 commits behind head on main.

❗ Current head 0a620a4 differs from pull request most recent head b3cd0fc. Consider uploading reports for the commit b3cd0fc to get more accurate results

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2016      +/-   ##
==========================================
- Coverage   82.71%   81.31%   -1.40%     
==========================================
  Files          62       62              
  Lines        6623     6766     +143     
==========================================
+ Hits         5478     5502      +24     
- Misses       1145     1264     +119

Files	Coverage Δ
daft/execution/physical_plan.py	`94.69% <50.00%> (-0.49%)`	⬇️
daft/execution/rust_physical_plan_shim.py	`94.44% <50.00%> (-4.05%)`	⬇️
daft/execution/execution_step.py	`91.36% <59.09%> (-2.20%)`	⬇️
daft/logical/builder.py	`83.19% <8.33%> (-8.40%)`	⬇️
daft/dataframe/dataframe.py	`82.70% <4.34%> (-6.11%)`	⬇️
daft/table/table_io.py	`75.17% <23.80%> (-14.78%)`	⬇️

samster25 added 6 commits March 13, 2024 16:51

wip data file creation

a94707d

wip iceberg writes

f8234a5

threaded through all the way

980e981

working end to end writes

34d9808

ensure field ids get written

54fc476

merge in main

b8718fb

github-actions bot added the enhancement New feature or request label Mar 19, 2024

samster25 added 2 commits March 19, 2024 00:29

rogue dataclass

ba5701a

added check for pyiceberg vesion

9a13925

samster25 commented Mar 19, 2024

View reviewed changes

Fokko reviewed Mar 19, 2024

View reviewed changes

Fokko mentioned this pull request Mar 19, 2024

add_files support partitioned tables apache/iceberg-python#531

Merged

samster25 added 3 commits March 19, 2024 21:58

lints

0582b4b

put iceberg behind feature flag

c0e4d05

multiline explain

0a620a4

samster25 added 3 commits March 19, 2024 22:38

add docstring

5258b54

add tests

a4f2607

merge in protocol

a3bbc34

samster25 enabled auto-merge (squash) March 20, 2024 07:17

import or skip

b3cd0fc

samster25 merged commit c2db062 into main Mar 20, 2024
29 checks passed

samster25 deleted the sammy/iceberg-writes-unpartitioned branch March 20, 2024 07:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEAT] iceberg writes unpartitioned #2016

[FEAT] iceberg writes unpartitioned #2016

samster25 commented Mar 15, 2024

samster25 Mar 19, 2024

Fokko Mar 19, 2024

samster25 Mar 20, 2024

samster25 Mar 19, 2024

Fokko Mar 19, 2024

Fokko Mar 19, 2024

Fokko Mar 19, 2024

Fokko Mar 19, 2024

codecov bot commented Mar 20, 2024 •

edited

Loading

		return pa.table(columns, schema=input_schema)


		def write_iceberg(

[FEAT] iceberg writes unpartitioned #2016

[FEAT] iceberg writes unpartitioned #2016

Conversation

samster25 commented Mar 15, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Mar 20, 2024 • edited Loading

Codecov Report

codecov bot commented Mar 20, 2024 •

edited

Loading