Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Fix intersection checking when unioning schemas #3039

Merged
merged 3 commits into from
Oct 15, 2024

Conversation

desmondcheongzx
Copy link
Contributor

@desmondcheongzx desmondcheongzx commented Oct 14, 2024

In the definition of Schema::union, the error message suggests that we intended to throw errors when performing a union on two schemas with overlapping keys. However, the original implementation took the set difference of keys between one side of the union and itself, which would never throw an error.

This bug was not noticed because the python tests went through the python code path which would check for the intersection correctly. But if one uses the Rust API directly, then this property is not upheld.

We fix this bug by instead checking that the two sides of the union have distinct keys.

@github-actions github-actions bot added the bug Something isn't working label Oct 14, 2024
Copy link

codspeed-hq bot commented Oct 14, 2024

CodSpeed Performance Report

Merging #3039 will not alter performance

Comparing desmondcheongzx:fix-schema-union (f3ba7b1) with main (a3453d1)

Summary

✅ 17 untouched benchmarks

@kevinzwang
Copy link
Member

Looks like it's failing some parquet integration tests

@@ -870,7 +870,7 @@ pub fn read_csv_into_micropartition(
let unioned_schema = tables
.iter()
.map(|tbl| tbl.schema.clone())
.try_reduce(|s1, s2| s1.union(s2.as_ref()).map(Arc::new))?
.reduce(|s1, s2| Arc::new(s1.non_distinct_union(s2.as_ref())))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When do we have two tables that we are unioning that have common columns? Just want to make sure that this non-distinct union is the correct behavior.

Copy link
Contributor Author

@desmondcheongzx desmondcheongzx Oct 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This really only gets used in the MicroPartition API for reading multiple parquet files. E.g. daft.table.MicroPartition.read_parquet_bulk(["file1.parquet", "file2.parquet"])

Here both files can have the same columns.

The other MicroPartition APIs for read_parquet, read_csv, and read_json are non-concerns because they only ever take in one uri. But the read_{csv, json, parquet}_into_micropartition functions they call under the hood take in a slice of uris and can run into the same problem that read_parquet_bulk currently does. As of today there are no other users besides read_parquet_bulk that read more than one uri.

FWIW I believe the original authors (@jaychia and @clarkzinzow) intended to use the semantics of a non-distinct union. But I'm not 100% sure why we would bother with the cases where the schemas were mismatched---I imagine this would quickly blow up elsewhere.

src/daft-schema/src/schema.rs Outdated Show resolved Hide resolved
src/daft-schema/src/schema.rs Outdated Show resolved Hide resolved
@desmondcheongzx desmondcheongzx enabled auto-merge (squash) October 15, 2024 00:44
@desmondcheongzx desmondcheongzx merged commit c8871d0 into Eventual-Inc:main Oct 15, 2024
38 checks passed
@desmondcheongzx desmondcheongzx deleted the fix-schema-union branch October 15, 2024 01:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants