Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-35363][SQL] Refactor sort merge join code-gen be agnostic to join type #32495

Closed
wants to merge 1 commit into from

Conversation

c21
Copy link
Contributor

@c21 c21 commented May 10, 2021

What changes were proposed in this pull request?

This is a pre-requisite of #32476, in discussion of #32476 (comment) . This is to refactor sort merge join code-gen to depend on streamed/buffered terminology, which makes the code-gen agnostic to different join types and can be extended to support other join types than inner join.

Why are the changes needed?

Pre-requisite of #32476.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing unit test in InnerJoinSuite.scala for inner join code-gen.

@c21
Copy link
Contributor Author

c21 commented May 10, 2021

cc @cloud-fan and @maropu , thanks.

@SparkQA
Copy link

SparkQA commented May 10, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42865/

@SparkQA
Copy link

SparkQA commented May 10, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42865/

@c21
Copy link
Contributor Author

c21 commented May 11, 2021

Thank you @maropu for review. There was a maven compilation error in github action. It looks transient to me. Waiting for jenkins test result now.

@maropu
Copy link
Member

maropu commented May 11, 2021

Ah, Jenkins does not check the java11 compilation? I think It's be better to re-invoke the GA tests just in case for checking if the compilation is okay. (though I also think that's transient error as you said).

@c21
Copy link
Contributor Author

c21 commented May 11, 2021

Thanks @maropu, restarted GA test here - https://github.com/c21/spark/actions/runs/829684422 .

@maropu
Copy link
Member

maropu commented May 11, 2021

okay, I've checked that it passed. I will merge this.

@maropu maropu closed this in c4ca232 May 11, 2021
@maropu
Copy link
Member

maropu commented May 11, 2021

All the GA tests passed. Merged to master. Thank you, @c21

@c21
Copy link
Contributor Author

c21 commented May 11, 2021

Thank you @maropu for monitoring and review!

@c21 c21 deleted the smj-refactor branch May 11, 2021 02:23
@SparkQA
Copy link

SparkQA commented May 11, 2021

Test build #138343 has finished for PR 32495 at commit 1912436.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait ShuffledJoin extends JoinCodegenSupport

| }
| } while ($leftRow != null);
| } while ($streamedRow != null);
| }
| return false; // unreachable
|}
""".stripMargin, inlineToOuterClass = true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have multiple SMJ in one whole-stage, will we have multiple findNextJoinRows methods in the outer class and fail the compilation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is applying to inner join before this PR as well. There won't be an issue and we will never have multiple findNextJoinRows in one class. The reason is with current design, sort merge join will always do code-gen for its children separately, so there won't be two SortMergeJoinExecs code-gen-ed in the same class.

Verified with example query:

val df1 = spark.range(10).select($"id".as("k1"))
val df2 = spark.range(4).select($"id".as("k2"))
val df3 = spark.range(6).select($"id".as("k3"))
df3.join(df2.hint("SHUFFLE_MERGE"), $"k3" === $"k2", "left_outer")
  .join(df1.hint("SHUFFLE_MERGE"), $"k3" === $"k1", "right_outer")
  .explain("codegen")

Query plan:

*(8) SortMergeJoin [k3#10L], [k1#2L], RightOuter
:- *(5) SortMergeJoin [k3#10L], [k2#6L], LeftOuter
:  :- *(2) Sort [k3#10L ASC NULLS FIRST], false, 0
:  :  +- Exchange hashpartitioning(k3#10L, 5), ENSURE_REQUIREMENTS, [id=#43]
:  :     +- *(1) Project [id#8L AS k3#10L]
:  :        +- *(1) Range (0, 6, step=1, splits=2)
:  +- *(4) Sort [k2#6L ASC NULLS FIRST], false, 0
:     +- Exchange hashpartitioning(k2#6L, 5), ENSURE_REQUIREMENTS, [id=#49]
:        +- *(3) Project [id#4L AS k2#6L]
:           +- *(3) Range (0, 4, step=1, splits=2)
+- *(7) Sort [k1#2L ASC NULLS FIRST], false, 0
   +- Exchange hashpartitioning(k1#2L, 5), ENSURE_REQUIREMENTS, [id=#58]
      +- *(6) Project [id#0L AS k1#2L]
         +- *(6) Range (0, 10, step=1, splits=2)

All generated code is in https://gist.github.com/c21/873775bcd08583105b289e67221f6e17.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it, thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

worth to add a comment for it, in case we changed this in the future (codegen one side with SMJ together)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or we just make it future proof and create a fresh function name here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan - thanks for calling it out. I would like to make it future proof by giving it a fresh name. Let me do it in a followup PR. Actually I was wondering the same question as you when implementing and spent some time to figuring it out.

dongjoon-hyun pushed a commit that referenced this pull request May 16, 2021
…ead of hardcoding it

### What changes were proposed in this pull request?

This is a followup from discussion in #32495 (comment) . The hardcoded function name `findNextJoinRows` is not a real problem now as we always do code generation for SMJ's children separately. But this change is to make it future proof in case this assumption changed in the future.

### Why are the changes needed?

Fix the potential reliability issue.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing unit tests.

Closes #32548 from c21/smj-followup.

Authored-by: Cheng Su <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants