Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ARROW-12441] [DataFusion] Cross join implementation #11

Merged
merged 14 commits into from
Apr 22, 2021

Conversation

Dandandan
Copy link
Contributor

@Dandandan Dandandan commented Apr 19, 2021

This is a first (naive, but probably not that bad) implementation of the cross join and CROSS JOIN syntax.

https://issues.apache.org/jira/browse/ARROW-12441

The left side gets loaded into memory and the right side is streamed and gets combined with the left side.

To keep memory usage down, we keep a "cursor" of values on the left side and producing the batches one by one instead of concatenating the result of the full cartesian product.

FYI @andygrove @alamb @jorgecarleitao

This also makes query 9 run in DataFusion (though performance is not OK, but I believe that should be not related to the cross join itself, but is caused by another issue).

@codecov-commenter
Copy link

codecov-commenter commented Apr 19, 2021

Codecov Report

Merging #11 (734b172) into master (c365a4f) will increase coverage by 0.12%.
The diff coverage is 81.33%.

❗ Current head 734b172 differs from pull request most recent head 844cb4b. Consider uploading reports for the commit 844cb4b to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##           master      #11      +/-   ##
==========================================
+ Coverage   70.41%   70.53%   +0.12%     
==========================================
  Files         123      124       +1     
  Lines       21261    21453     +192     
==========================================
+ Hits        14970    15132     +162     
- Misses       6291     6321      +30     
Impacted Files Coverage Δ
...lista/rust/core/src/serde/logical_plan/to_proto.rs 0.00% <ø> (ø)
datafusion/src/optimizer/constant_folding.rs 91.95% <0.00%> (-0.36%) ⬇️
datafusion/src/optimizer/projection_push_down.rs 98.66% <ø> (ø)
datafusion/src/physical_plan/hash_utils.rs 100.00% <ø> (+2.89%) ⬆️
datafusion/src/physical_plan/mod.rs 87.09% <ø> (ø)
datafusion/src/logical_plan/plan.rs 79.83% <61.53%> (-0.56%) ⬇️
datafusion/src/physical_plan/cross_join.rs 76.56% <76.56%> (ø)
datafusion/src/optimizer/hash_build_probe_order.rs 62.60% <88.88%> (+6.93%) ⬆️
benchmarks/src/bin/tpch.rs 38.86% <100.00%> (+0.52%) ⬆️
datafusion/src/logical_plan/builder.rs 89.18% <100.00%> (+0.30%) ⬆️
... and 9 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c365a4f...844cb4b. Read the comment docs.

@jorgecarleitao
Copy link
Member

Amazing! :D

Since this will land on the next release, I suggest that we add / migrate the corresponding JIRA issue here, so that we can map issues <> features, bugs, etc. to create a change log.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is looking very cool -- thanks @Dandandan

I think the input handling of the cross join may be reversed from the intention

datafusion/src/physical_plan/cross_join.rs Outdated Show resolved Hide resolved
datafusion/src/physical_plan/cross_join.rs Outdated Show resolved Hide resolved
datafusion/src/physical_plan/cross_join.rs Show resolved Hide resolved
datafusion/src/physical_plan/cross_join.rs Show resolved Hide resolved
Comment on lines +50 to +53
/// left (build) side which gets loaded in memory
left: Arc<dyn ExecutionPlan>,
/// right (probe) side which are combined with left side
right: Arc<dyn ExecutionPlan>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My reading of the code below is that it is the right side that is loaded into memory and then the left is streamed through.

I think the changes to the planner i this PR are trying to send the smaller input to the left hand input, I think it might make sense to change the implementation to buffer the left side rather than the right side

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

relevant code:

                    let merge = MergeExec::new(self.left.clone());

left side should be loaded into memory (is the intention, but I am pretty sure we are doing this?), and we try to make the smallest side the left side (as far as possible).

datafusion/tests/sql.rs Outdated Show resolved Hide resolved
@jorgecarleitao jorgecarleitao added the datafusion Changes in the datafusion crate label Apr 20, 2021
@alamb
Copy link
Contributor

alamb commented Apr 20, 2021

I rewrote master to remove the history of the other languages (and 50MB of history). This means this PR will need to be "rebased" against the current master.

The best way I found to do this was to find the relevant commits (via git log) and then cherry-pick them one at a time. For example (change COMMIT-SHA to whatever the commit actually is):

git reset --hard apache/master
git cherry-pick COMMIT-SHA
git push -f your_rep

@alamb
Copy link
Contributor

alamb commented Apr 20, 2021

@Dandandan I am happy to rebase this PR against new master. Let me know if that would be helpful

@Dandandan
Copy link
Contributor Author

Looks like this is updated too

@Dandandan
Copy link
Contributor Author

@alamb I updated this and replied with some comments on the locking for the reference to the record batch

@alamb alamb changed the title [DataFusion] Cross join implementation [ARROW-12441] [DataFusion] Cross join implementation Apr 22, 2021
@alamb alamb merged commit 57eeb64 into apache:master Apr 22, 2021
@jorgecarleitao jorgecarleitao added the enhancement New feature or request label Apr 22, 2021
yjshen referenced this pull request in yjshen/datafusion Dec 21, 2021
andygrove added a commit that referenced this pull request Jan 12, 2023
* Initial commit

* initial commit

* failing test

* table scan projection

* closer

* test passes, with some hacks

* use DataFrame (#2)

* update README

* update dependency

* code cleanup (#3)

* Add support for Filter operator and BinaryOp expressions (#4)

* GitHub action (#5)

* Split code into producer and consumer modules (#6)

* Support more functions and scalar types (#7)

* Use substrait 0.1 and datafusion 8.0 (#8)

* use substrait 0.1

* use datafusion 8.0

* update datafusion to 10.0 and substrait to 0.2 (#11)

* Add basic join support (#12)

* Added fetch support (#23)

Added fetch to consumer

Added limit to producer

Added unit tests for limit

Added roundtrip_fill_none() for testing when None input can be converted to 0

Update src/consumer.rs

Co-authored-by: Andy Grove <[email protected]>

Co-authored-by: Andy Grove <[email protected]>

* Upgrade to DataFusion 13.0.0 (#25)

* Add sort consumer and producer (#24)

Add consumer

Add producer and test

Modified error string

* Add serializer/deserializer (#26)

* Add plan and function extension support (#27)

* Add plan and function extension support

* Removed unwraps

* Implement GROUP BY (#28)

* Add consumer, producer and tests for aggregate relation

Change function extension registration from absolute to relative anchor
(reference)

Remove operator to/from reference

* Fixed function registration bug

* Add test

* Addressed PR comments

* Changed field reference from mask to direct reference (#29)

* Changed field reference from masked reference to direct reference

* Handle unsupported case (struct with child)

* Handle SubqueryAlias (#30)

Fixed aggregate function register bug

* Add support for SELECT DISTINCT (#31)

Add test case

* Implement BETWEEN (#32)

* Add case (#33)

* Implement CASE WHEN

* Add more case to test

* Addressed comments

* feat: support explicit catalog/schema names in ReadRel (#34)

* feat: support explicit catalog/schema names in ReadRel

Signed-off-by: Ruihang Xia <[email protected]>

* fix: use re-exported expr crate

Signed-off-by: Ruihang Xia <[email protected]>

Signed-off-by: Ruihang Xia <[email protected]>

* move files to subfolder

* RAT

* remove rust.yaml

* revert .gitignore changes

* tomlfmt

* tomlfmt

Signed-off-by: Ruihang Xia <[email protected]>
Co-authored-by: Daniël Heres <[email protected]>
Co-authored-by: JanKaul <[email protected]>
Co-authored-by: nseekhao <[email protected]>
Co-authored-by: Ruihang Xia <[email protected]>
alamb added a commit that referenced this pull request Jul 16, 2024
… `interval` (#11466)

* Unparser rule for datatime cast (#10)

* use timestamp as the identifier for date64

* rename

* implement CustomDialectBuilder

* fix

* dialect with interval style (#11)

---------

Co-authored-by: Phillip LeBlanc <[email protected]>

* fmt

* clippy

* doc

* Update datafusion/sql/src/unparser/expr.rs

Co-authored-by: Andrew Lamb <[email protected]>

* update the doc for CustomDialectBuilder

* fix doc test

---------

Co-authored-by: Phillip LeBlanc <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>
xinlifoobar pushed a commit to xinlifoobar/datafusion that referenced this pull request Jul 17, 2024
… `interval` (apache#11466)

* Unparser rule for datatime cast (apache#10)

* use timestamp as the identifier for date64

* rename

* implement CustomDialectBuilder

* fix

* dialect with interval style (apache#11)

---------

Co-authored-by: Phillip LeBlanc <[email protected]>

* fmt

* clippy

* doc

* Update datafusion/sql/src/unparser/expr.rs

Co-authored-by: Andrew Lamb <[email protected]>

* update the doc for CustomDialectBuilder

* fix doc test

---------

Co-authored-by: Phillip LeBlanc <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>
xinlifoobar pushed a commit to xinlifoobar/datafusion that referenced this pull request Jul 18, 2024
… `interval` (apache#11466)

* Unparser rule for datatime cast (apache#10)

* use timestamp as the identifier for date64

* rename

* implement CustomDialectBuilder

* fix

* dialect with interval style (apache#11)

---------

Co-authored-by: Phillip LeBlanc <[email protected]>

* fmt

* clippy

* doc

* Update datafusion/sql/src/unparser/expr.rs

Co-authored-by: Andrew Lamb <[email protected]>

* update the doc for CustomDialectBuilder

* fix doc test

---------

Co-authored-by: Phillip LeBlanc <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datafusion Changes in the datafusion crate enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants