-
Notifications
You must be signed in to change notification settings - Fork 159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Fix intersection checking when unioning schemas #3039
[BUG] Fix intersection checking when unioning schemas #3039
Conversation
CodSpeed Performance ReportMerging #3039 will not alter performanceComparing Summary
|
Looks like it's failing some parquet integration tests |
@@ -870,7 +870,7 @@ pub fn read_csv_into_micropartition( | |||
let unioned_schema = tables | |||
.iter() | |||
.map(|tbl| tbl.schema.clone()) | |||
.try_reduce(|s1, s2| s1.union(s2.as_ref()).map(Arc::new))? | |||
.reduce(|s1, s2| Arc::new(s1.non_distinct_union(s2.as_ref()))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When do we have two tables that we are unioning that have common columns? Just want to make sure that this non-distinct union is the correct behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This really only gets used in the MicroPartition API for reading multiple parquet files. E.g. daft.table.MicroPartition.read_parquet_bulk(["file1.parquet", "file2.parquet"])
Here both files can have the same columns.
The other MicroPartition APIs for read_parquet
, read_csv
, and read_json
are non-concerns because they only ever take in one uri. But the read_{csv, json, parquet}_into_micropartition
functions they call under the hood take in a slice of uris and can run into the same problem that read_parquet_bulk
currently does. As of today there are no other users besides read_parquet_bulk
that read more than one uri.
FWIW I believe the original authors (@jaychia and @clarkzinzow) intended to use the semantics of a non-distinct union. But I'm not 100% sure why we would bother with the cases where the schemas were mismatched---I imagine this would quickly blow up elsewhere.
In the definition of
Schema::union
, the error message suggests that we intended to throw errors when performing a union on two schemas with overlapping keys. However, the original implementation took the set difference of keys between one side of the union and itself, which would never throw an error.This bug was not noticed because the python tests went through the python code path which would check for the intersection correctly. But if one uses the Rust API directly, then this property is not upheld.
We fix this bug by instead checking that the two sides of the union have distinct keys.