-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug in HashJoin output_partition cause the input from left input not fully executed #5738
Comments
I will take a look. |
@duongcongtoai |
Thank you. Shoud we have a small validation in new function to notify users about this constraint? |
@duongcongtoai Agree. We should add a validation to enforce this constraint. |
Okay, let me open my first PR :D |
Sure, I think you can add the input partition count check during the real execution time(not plan time) in the |
@mingmwang do we have special reason to validate at execution time instead of plan time? If the constraint is violated, we can avoid starting uncessary execution in partitioned tasks right? |
Describe the bug
The HashJoinExec decides output_partition based on this function: https://github.com/apache/arrow-datafusion/blob/b7a33317c2abf265f4ab6b3fe636f87c4d01334c/datafusion/core/src/physical_plan/joins/utils.rs#L90
If PartitionMode is set to Partitioned, join_type is RIGHT, output_partition will depend on output_partition of the right child, this may cause missing execution on left child partitions, if left child has more partitions than right child partition: https://github.com/apache/arrow-datafusion/blob/e87754cfe3afa4c358a8ca9c21c3c4acd020dfe5/datafusion/core/src/physical_plan/joins/hash_join.rs#L413
To Reproduce
Code in this gist
Create 2 ExecutionPlan input from csv with only 1 field "id" and create a HashJoinExec from these inputs. Because during the execution, some parition from the left input is not executed on, they are never probed with associated rows in the right input, so result in a false join:
Expected behavior
HashJoin executes correctly
Additional context
No response
The text was updated successfully, but these errors were encountered: