Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Writes from empty partitions should return empty micropartitions with non-null schema #2952

Merged
merged 3 commits into from
Sep 30, 2024

Conversation

colin-ho
Copy link
Contributor

If one partition is empty the write will return a list of file paths / partition cols but the data type is NULL. This is problematic because it will cause schema mismatch with other partitions that did have writes.

import daft

df = (
    daft.from_pydict({"foo": [1, 2, 3], "bar": ["a", "b", "c"]})
    .into_partitions(4)
    .write_parquet("z", partition_cols=["bar"])
)
print(df)

daft.exceptions.DaftCoreException: DaftError::SchemaMismatch MicroPartition concat requires all schemas to match, ╭─────────────┬──────╮
│ Column Name ┆ Type │
╞═════════════╪══════╡
│ path        ┆ Utf8 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ bar         ┆ Utf8 │
╰─────────────┴──────╯
 vs ╭─────────────┬──────╮
│ Column Name ┆ Type │
╞═════════════╪══════╡
│ path        ┆ Null │
╰─────────────┴──────╯

@github-actions github-actions bot added the bug Something isn't working label Sep 26, 2024
Copy link

codspeed-hq bot commented Sep 26, 2024

CodSpeed Performance Report

Merging #2952 will not alter performance

Comparing colin/fix-writes-with-empty-parts (5b6873a) with main (f1194b5)

Summary

✅ 17 untouched benchmarks

Copy link

codecov bot commented Sep 26, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 77.99%. Comparing base (28e72b2) to head (5b6873a).
Report is 11 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #2952      +/-   ##
==========================================
- Coverage   78.38%   77.99%   -0.39%     
==========================================
  Files         596      598       +2     
  Lines       69693    70742    +1049     
==========================================
+ Hits        54627    55175     +548     
- Misses      15066    15567     +501     
Flag Coverage Δ
77.99% <100.00%> (-0.39%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
daft/iceberg/iceberg_write.py 76.47% <100.00%> (+1.03%) ⬆️
daft/table/table_io.py 85.96% <100.00%> (+0.25%) ⬆️

... and 55 files with indirect coverage changes

Copy link
Member

@kevinzwang kevinzwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Thanks for fixing this

daft/iceberg/iceberg_write.py Outdated Show resolved Hide resolved
@colin-ho colin-ho merged commit f10d4da into main Sep 30, 2024
40 checks passed
@colin-ho colin-ho deleted the colin/fix-writes-with-empty-parts branch September 30, 2024 20:06
sagiahrac pushed a commit to sagiahrac/Daft that referenced this pull request Oct 7, 2024
…s with non-null schema (Eventual-Inc#2952)

If one partition is empty the write will return a list of file paths /
partition cols but the data type is NULL. This is problematic because it
will cause schema mismatch with other partitions that did have writes.

```
import daft

df = (
    daft.from_pydict({"foo": [1, 2, 3], "bar": ["a", "b", "c"]})
    .into_partitions(4)
    .write_parquet("z", partition_cols=["bar"])
)
print(df)

daft.exceptions.DaftCoreException: DaftError::SchemaMismatch MicroPartition concat requires all schemas to match, ╭─────────────┬──────╮
│ Column Name ┆ Type │
╞═════════════╪══════╡
│ path        ┆ Utf8 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ bar         ┆ Utf8 │
╰─────────────┴──────╯
 vs ╭─────────────┬──────╮
│ Column Name ┆ Type │
╞═════════════╪══════╡
│ path        ┆ Null │
╰─────────────┴──────╯
```

---------

Co-authored-by: Colin Ho <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants