-
Notifications
You must be signed in to change notification settings - Fork 159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEAT] iceberg writes unpartitioned #2016
Conversation
# We perform the merge here since IcebergTable is not pickle-able | ||
merge = _MergingSnapshotProducer(operation=operation, table=table) | ||
|
||
builder = self._builder.write_iceberg(table) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Fokko can you take a look to see if this is fine?
Since we can't pickle the Iceberg Table and we still want distributed writes, we're having daft write out the parquet files and then return pyiceberg DataFiles in the distributed dataframe. Then we commit the data files with _MergingSnapshotProducer
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @samster25
Since we can't pickle the Iceberg Table and we still want distributed writes, we're having daft write out the parquet files and then return pyiceberg DataFiles in the distributed dataframe.
That sounds like a good approach!
Then we commit the data files with _MergingSnapshotProducer.
Typically you want to use a higher level API. For example the transaction API: https://github.com/apache/iceberg-python/blob/a7f207f7e5831b3be02bd023c4b33babc3ea13f6/pyiceberg/table/__init__.py#L1153-L1161
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great!
Typically you want to use a higher level API. For example the transaction API: https://github.com/apache/iceberg-python/blob/a7f207f7e5831b3be02bd023c4b33babc3ea13f6/pyiceberg/table/__init__.py#L1153-L1161
That would be much cleaner but unfortunately it's not in the 0.6.0 release!
https://github.com/apache/iceberg-python/blob/cc449266e7fe0e97f23e61b3c732b75a0d0a8dec/pyiceberg/table/__init__.py#L185
return pa.table(columns, schema=input_schema) | ||
|
||
|
||
def write_iceberg( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Fokko This is where we are writing out the iceberg files. We should be correctly writing out the field_id
via schema_to_pyarrow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's the most important part I was looking for. You can easily check this using parq
:
parq 00000-0-39ec4caa-4d45-46b9-a6e9-c35cfb4e9290-0.parquet --schema
# Schema
<pyarrow._parquet.ParquetSchema object at 0x131ca1c80>
required group field_id=1 schema {
optional double field_id=2 lat;
optional double field_id=3 long;
}
else: | ||
raise ValueError(f"Only support `append` or `overwrite` mode. {mode} is unsupported") | ||
|
||
# We perform the merge here since IcebergTable is not pickle-able |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the works: apache/iceberg-python#513
@@ -527,3 +520,185 @@ def file_visitor(written_file, i=i): | |||
for c_name in partition_values.column_names(): | |||
data_dict[c_name] = partition_values.get_column(c_name).take(partition_idx_series) | |||
return MicroPartition.from_pydict(data_dict) | |||
|
|||
|
|||
def coerce_pyarrow_table_to_schema(pa_table: pa.Table, input_schema: pa.Schema) -> pa.Table: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this very frustrating of Arrow where you cannot just cast the table. I did already quite a bit of work on this on the Arrow side, but is still not complete :(
# TODO: these should be populate by `properties` but pyiceberg doesn't support them yet | ||
target_file_size = 512 * 1024 * 1024 | ||
TARGET_ROW_GROUP_SIZE = 128 * 1024 * 1024 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2016 +/- ##
==========================================
- Coverage 82.71% 81.31% -1.40%
==========================================
Files 62 62
Lines 6623 6766 +143
==========================================
+ Hits 5478 5502 +24
- Misses 1145 1264 +119
|
No description provided.