Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

infer right side nullability for LEFT join #5748

Merged
merged 6 commits into from
Mar 29, 2023
Merged

Conversation

comphead
Copy link
Contributor

@comphead comphead commented Mar 27, 2023

Which issue does this PR close?

Closes partially #5747 (LEFT Join Mismatch between schema and batches on a CREATE TABLE with a windowing query #5695 (comment)).

Rationale for this change

Weakens the schema comparison between planner schema and batches leaving only name and datatype to avoid false positives on checking not null and nullable columns

What changes are included in this PR?

Are these changes tested?

yes

Are there any user-facing changes?

@github-actions github-actions bot added core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Mar 27, 2023
@comphead comphead marked this pull request as ready for review March 27, 2023 05:18
@milevin
Copy link

milevin commented Mar 27, 2023

@comphead I am not sure about this change; it doesn't seem sound.

The analyzer is assigning incorrect nullability flags to the right side of the join. I wonder if we shouldn't fix this instead.

@comphead
Copy link
Contributor Author

I think its not that easy for analyzer to identify correct nullable flags without scanning entire data. for LEFT join it assumes right side cannot be matched and sets nullable and this sounds correct for me.

From other side we also have similar weaken check in optimizer https://github.com/apache/arrow-datafusion/blob/26b8377b0690916deacf401097d688699026b8fb/datafusion/optimizer/src/optimizer.rs#L423

@milevin
Copy link

milevin commented Mar 27, 2023

Consider the above query. After plan_from_tables is finished in select.rs, the assigned column properties are as follows:

        DFField {
            qualifier: Some(
                Bare {
                    table: "t1",
                },
            ),
            field: Field {
                name: "col1",
                data_type: Int64,
                nullable: false,
                dict_id: 0,
                dict_is_ordered: false,
                metadata: {},
            },
        },
        DFField {
            qualifier: Some(
                Bare {
                    table: "t1",
                },
            ),
            field: Field {
                name: "col2",
                data_type: Utf8,
                nullable: false,
                dict_id: 0,
                dict_is_ordered: false,
                metadata: {},
            },
        },
        DFField {
            qualifier: Some(
                Bare {
                    table: "t2",
                },
            ),
            field: Field {
                name: "col3",
                data_type: Int64,
                nullable: false,
                dict_id: 0,
                dict_is_ordered: false,
                metadata: {},
            },
        },
        DFField {
            qualifier: Some(
                Bare {
                    table: "t2",
                },
            ),
            field: Field {
                name: "col4",
                data_type: Utf8,
                nullable: false,
                dict_id: 0,
                dict_is_ordered: false,
                metadata: {},
            },
        },

The two t2 fields have incorrect nullability flag inferred. Unless I am missing something, the plan_from_tables method does not need to scan the data to infer the correct nullability flags. All it needs is the understanding of which kind of join it is processing and which side of that join the columns are coming from.

Does it make sense?

@comphead
Copy link
Contributor Author

comphead commented Mar 27, 2023

yes, good catch, col4 has to have a true nullability. That might happen because optimizer sees literals and assumes they cannot be nulls, which will be incorrect in case of left join for sure. Checking this

@comphead comphead marked this pull request as draft March 27, 2023 17:00
@milevin
Copy link

milevin commented Mar 27, 2023

This is before optimizer I believe. It's just the initial stage of building the logical plan from the query.

@github-actions github-actions bot added the logical-expr Logical plan and expressions label Mar 28, 2023
@comphead comphead changed the title weaken schema check on mem table infer right side nullability for LEFT join Mar 28, 2023
@github-actions github-actions bot added the optimizer Optimizer rules label Mar 28, 2023
@milevin
Copy link

milevin commented Mar 28, 2023

This is great @comphead! Check out the comments above.

@comphead comphead marked this pull request as ready for review March 28, 2023 14:39
.cloned()
.collect()
}
JoinType::Left => {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll fix other types of joins on separate PR

@comphead
Copy link
Contributor Author

cool, @milevin @alamb @jackwener please review whenever you have time

@milevin
Copy link

milevin commented Mar 28, 2023

cool, @milevin @alamb @jackwener please review whenever you have time

I'll approve, but can you address the question about equivalent_names_and_types still remaining in this PR?

JoinType::Inner | JoinType::Left | JoinType::Full | JoinType::Right => {
let right_fields = right.fields().iter();
let left_fields = left.fields().iter();
JoinType::Inner | JoinType::Full | JoinType::Right => {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right should be treated symmetrically opposite of Left.

Full should be nullified on both sides.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -530,6 +530,13 @@ FROM t1
----
11 a 55

# test create table from query with LEFT join
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great to add tests like this for Right, Full, and all the other flavors of joins.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -234,6 +235,18 @@ pub(crate) fn parse_identifiers_normalized(s: &str) -> Vec<String> {
.collect::<Vec<_>>()
}

/// Check two schemas for being equal for field names/types
pub fn equivalent_names_and_types(schema: &SchemaRef, other: SchemaRef) -> bool {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you still need this change for equivalent_names_and_types? I don't think it's used in your solution to the problem; right?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@comphead is this change relevant here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, done, Ive got a better idea how to refactor this method.

@comphead
Copy link
Contributor Author

@alamb @jackwener please check this PR so I can follow up later on other join types

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change makes sense to me -- thank you @comphead

It is somewhat amazing that his wasn't caught by existing tests.

@alamb
Copy link
Contributor

alamb commented Mar 29, 2023

The analyzer is assigning incorrect nullability flags to the right side of the join. I wonder if we shouldn't fix this inst

I believe the story is that @comphead plans to fix this in a follow on PR. Is that accurate @comphead ?

Thank you for the review @milevin

@milevin
Copy link

milevin commented Mar 29, 2023

The analyzer is assigning incorrect nullability flags to the right side of the join. I wonder if we shouldn't fix this inst

I believe the story is that @comphead plans to fix this in a follow on PR. Is that accurate @comphead ?

@alamb This comment of mine was on an original version of this PR which tried to address the issue by weakening the compatibility check; @comphead followed the suggestion in the comment to rewrite the fix by changing the analyzer instead -- so all is good!

Note: for the case of LEFT join, the right side must be nullified (that's what the comment talks about). A follow on PR will handle RIGHT and FULL OUTER joins where the left side will have to be adjusted as well.

-- Michael

@comphead
Copy link
Contributor Author

The analyzer is assigning incorrect nullability flags to the right side of the join. I wonder if we shouldn't fix this inst

I believe the story is that @comphead plans to fix this in a follow on PR. Is that accurate @comphead ?

@alamb This comment of mine was on an original version of this PR which tried to address the issue by weakening the compatibility check; @comphead followed the suggestion in the comment to rewrite the fix by changing the analyzer instead -- so all is good!

Note: for the case of LEFT join, the right side must be nullified (that's what the comment talks about). A follow on PR will handle RIGHT and FULL OUTER joins where the left side will have to be adjusted as well.

-- Michael

Right, for RIGHT/FULL OUTER joins I will create a follow up PR. This one is too overloaded :)

@alamb
Copy link
Contributor

alamb commented Mar 29, 2023

Thanks all !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate logical-expr Logical plan and expressions optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants