[BUG] Fix intersection checking when unioning schemas #3039

desmondcheongzx · 2024-10-14T22:22:03Z

In the definition of Schema::union, the error message suggests that we intended to throw errors when performing a union on two schemas with overlapping keys. However, the original implementation took the set difference of keys between one side of the union and itself, which would never throw an error.

This bug was not noticed because the python tests went through the python code path which would check for the intersection correctly. But if one uses the Rust API directly, then this property is not upheld.

We fix this bug by instead checking that the two sides of the union have distinct keys.

codspeed-hq · 2024-10-14T22:35:16Z

CodSpeed Performance Report

Merging #3039 will not alter performance

_{Comparing desmondcheongzx:fix-schema-union (f3ba7b1) with main (a3453d1)}

Summary

✅ 17 untouched benchmarks

kevinzwang · 2024-10-14T22:41:31Z

Looks like it's failing some parquet integration tests

kevinzwang · 2024-10-14T23:53:04Z

src/daft-micropartition/src/micropartition.rs

@@ -870,7 +870,7 @@ pub fn read_csv_into_micropartition(
            let unioned_schema = tables
                .iter()
                .map(|tbl| tbl.schema.clone())
-                .try_reduce(|s1, s2| s1.union(s2.as_ref()).map(Arc::new))?
+                .reduce(|s1, s2| Arc::new(s1.non_distinct_union(s2.as_ref())))


When do we have two tables that we are unioning that have common columns? Just want to make sure that this non-distinct union is the correct behavior.

This really only gets used in the MicroPartition API for reading multiple parquet files. E.g. daft.table.MicroPartition.read_parquet_bulk(["file1.parquet", "file2.parquet"])

Here both files can have the same columns.

The other MicroPartition APIs for read_parquet, read_csv, and read_json are non-concerns because they only ever take in one uri. But the read_{csv, json, parquet}_into_micropartition functions they call under the hood take in a slice of uris and can run into the same problem that read_parquet_bulk currently does. As of today there are no other users besides read_parquet_bulk that read more than one uri.

FWIW I believe the original authors (@jaychia and @clarkzinzow) intended to use the semantics of a non-distinct union. But I'm not 100% sure why we would bother with the cases where the schemas were mismatched---I imagine this would quickly blow up elsewhere.

src/daft-schema/src/schema.rs

Fix union

8aed3e5

desmondcheongzx requested a review from kevinzwang October 14, 2024 22:22

github-actions bot added the bug Something isn't working label Oct 14, 2024

Fix integration tests

c670812

kevinzwang approved these changes Oct 15, 2024

View reviewed changes

Address review comments

f3ba7b1

desmondcheongzx enabled auto-merge (squash) October 15, 2024 00:44

desmondcheongzx merged commit c8871d0 into Eventual-Inc:main Oct 15, 2024
38 checks passed

desmondcheongzx deleted the fix-schema-union branch October 15, 2024 01:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Fix intersection checking when unioning schemas #3039

[BUG] Fix intersection checking when unioning schemas #3039

desmondcheongzx commented Oct 14, 2024 •

edited

Loading

codspeed-hq bot commented Oct 14, 2024 •

edited

Loading

kevinzwang commented Oct 14, 2024

kevinzwang Oct 14, 2024

desmondcheongzx Oct 15, 2024 •

edited

Loading

[BUG] Fix intersection checking when unioning schemas #3039

[BUG] Fix intersection checking when unioning schemas #3039

Conversation

desmondcheongzx commented Oct 14, 2024 • edited Loading

codspeed-hq bot commented Oct 14, 2024 • edited Loading

CodSpeed Performance Report

Merging #3039 will not alter performance

Summary

kevinzwang commented Oct 14, 2024

kevinzwang Oct 14, 2024

Choose a reason for hiding this comment

desmondcheongzx Oct 15, 2024 • edited Loading

Choose a reason for hiding this comment

desmondcheongzx commented Oct 14, 2024 •

edited

Loading

codspeed-hq bot commented Oct 14, 2024 •

edited

Loading

desmondcheongzx Oct 15, 2024 •

edited

Loading