Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor interpreted code path + fix nulls in join keys #6

Merged
merged 18 commits into from
Jan 15, 2024

Conversation

Tom-Newton
Copy link
Owner

@Tom-Newton Tom-Newton commented Jan 14, 2024

  • Added some more tests around nulls in the join keys and duplicate input rows.
    • Fixed both codegen and interpreted code paths for nulls in join keys.
  • Refactor interpreted code path.
    • Remove lots of legacy from the normal spark sort merge join related to buffering multiple matches for each streamed row. With the PIT join we only allow one match per left row (this does create some non-determinism if there are duplicate input rows).
  • Update comments and renamed lots of variables
    • Rename streamed -> left and buffered -> right. Streamed and buffered only really made sense in the normal sort merge join because of the buffering multiple matches thing. Additionally in the PIT join the left and right dataframes are considered fundamentally different.

@Tom-Newton Tom-Newton changed the title Tomnewton/nullable key columns Refactor interpreted code path + fix nulls in join keys Jan 15, 2024
@Tom-Newton Tom-Newton marked this pull request as ready for review January 15, 2024 13:47
@Tom-Newton Tom-Newton merged commit e5fcee7 into tom-main Jan 15, 2024
@Tom-Newton Tom-Newton deleted the tomnewton/nullable_key_columns branch August 14, 2024 12:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant