Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should not convert a normal non-inner join to Cross Join when there are non-equal Join conditions #4363

Closed
mingmwang opened this issue Nov 25, 2022 · 5 comments · Fixed by #4562
Assignees
Labels
bug Something isn't working

Comments

@mingmwang
Copy link
Contributor

Describe the bug
A clear and concise description of what the bug is.

CREATE TEMPORARY TABLE t1 (t1_id INT,t1_name String);
CREATE TEMPORARY TABLE t2 (t2_id INT,t2_name String);

insert into t1 values (11, "a"), (22, "b"), (33, "c"), (44, "d");
insert into t2 values (11, "z"), (22, "y"), (44, "x"), (55, "w");

SELECT t1_id, t1_name, t2_name FROM t1 LEFT JOIN t2 ON (t1_id != t2_id and t2_id >= 100) ORDER BY t1_id;

SparkSQL result:

        "+-------+---------+---------+",
        "| t1_id | t1_name | t2_name |",
        "+-------+---------+---------+",
        "| 11    | a       |         |",
        "| 22    | b       |         |",
        "| 33    | c       |         |",
        "| 44    | d       |         |",
        "+-------+---------+---------+",

DataFusion UT

async fn error_cross_join() {
    let test_repartition_joins = vec![true, false];
    for repartition_joins in test_repartition_joins {
        let ctx = create_join_context("t1_id", "t2_id", repartition_joins).unwrap();

        let sql = "SELECT t1_id, t1_name, t2_name FROM t1 LEFT JOIN t2 ON (t1_id != t2_id and t2_id >= 100) ORDER BY t1_id";
        let actual = execute_to_batches(&ctx, sql).await;
        let expected = vec![
            "+-------+---------+---------+",
            "| t1_id | t1_name | t2_name |",
            "+-------+---------+---------+",
            "| 11    | a       |         |",
            "| 22    | b       |         |",
            "| 33    | c       |         |",
            "| 44    | d       |         |",
            "+-------+---------+---------+",
        ];

        assert_batches_eq!(expected, &actual);
    }
}

actual:

[
    "++",
    "++",
]

To Reproduce
Steps to reproduce the behavior:

Expected behavior
A clear and concise description of what you expected to happen.

Additional context
Add any other context about the problem here.

@mingmwang mingmwang added the bug Something isn't working label Nov 25, 2022
@mingmwang
Copy link
Contributor Author

Except for Inner Join, we can not treat join conditions as filter conditions(there are some cases join conditions can be pushed down, depends on the join expr and join type). Need to introduce a new Physical Join implementation nested_loop_join to fix the issues.

@mingmwang mingmwang changed the title Should not covert a normal non-inner join to Cross Join when there are non-equal Join conditions Should not convert a normal non-inner join to Cross Join when there are non-equal Join conditions Nov 25, 2022
@alamb
Copy link
Contributor

alamb commented Nov 26, 2022

I think it is also possible to extend the HashJoin implementation with the new semantics rather than adding an entirely new physical join implementation

@liukun4515
Copy link
Contributor

I think it is also possible to extend the HashJoin implementation with the new semantics rather than adding an entirely new physical join implementation

hash join can just support equal condition. If there is no the equal condition, we can't use the hash join.

If the condition is on left_table.column1 > 10 which can't be used for hash.

@alamb
Copy link
Contributor

alamb commented Nov 27, 2022

hash join can just support equal condition. If there is no the equal condition, we can't use the hash join.

If the condition is on left_table.column1 > 10 which can't be used for hash.

I was thinking something like you could use a hash table with a single entry when there are no join keys -- which will effectively devolve into a CrossJoin

What I really would hope to avoid is another join implementation if possible -- Joins are already complicated enough, and so adding another one will not improve the situation.

@mingmwang
Copy link
Contributor Author

I would prefer to keep a different join implementation. A different physical plan like 'nested_loop_join' will deliver the clear information to users and developers that the plan is actually a nested loop join.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants