Refactor interpreted code path + fix nulls in join keys #6

Tom-Newton · 2024-01-14T23:27:49Z

Added some more tests around nulls in the join keys and duplicate input rows.
- Fixed both codegen and interpreted code paths for nulls in join keys.
Refactor interpreted code path.
- Remove lots of legacy from the normal spark sort merge join related to buffering multiple matches for each streamed row. With the PIT join we only allow one match per left row (this does create some non-determinism if there are duplicate input rows).
Update comments and renamed lots of variables
- Rename streamed -> left and buffered -> right. Streamed and buffered only really made sense in the normal sort merge join because of the buffering multiple matches thing. Additionally in the PIT join the left and right dataframes are considered fundamentally different.

…atching behaviour

…de main search loop

Tom-Newton added 14 commits January 14, 2024 12:01

Start tests

479a9b3

Valid tests

ee804af

Fix codegen version

31a6a68

Tests with duplicate rows

46a73b1

Delete unnecessary code. It was a legacy of the normal join's multi m…

1e3a078

…atching behaviour

Working inner joins with more code removed and tolerance applied insi…

9488000

…de main search loop

Working inner join

2aff220

Tidy

8d7b74d

Very minor clean ups

c4ccfc2

Remove some unnecessary branching

031610b

Fix schema assertion for left_join_duplicate_join_keys

34d0b8b

Remove unneeded bound condition

45ae25a

Fix schema assertions for nulls in join keys

c6845d4

Update comments and rename variables

4c5bf34

Tom-Newton changed the title ~~Tomnewton/nullable key columns~~ Refactor interpreted code path + fix nulls in join keys Jan 15, 2024

Tom-Newton added 4 commits January 15, 2024 13:36

Auto-format

d672381

Remove slightly misleading comment

88ce2cb

Remove completed TODO comment

822b9c8

More comment adjustments

13106db

Tom-Newton marked this pull request as ready for review January 15, 2024 13:47

Tom-Newton merged commit e5fcee7 into tom-main Jan 15, 2024

Tom-Newton deleted the tomnewton/nullable_key_columns branch August 14, 2024 12:46

Provide feedback