[SPARK-35363][SQL] Refactor sort merge join code-gen be agnostic to join type #32495

c21 · 2021-05-10T22:42:30Z

What changes were proposed in this pull request?

This is a pre-requisite of #32476, in discussion of #32476 (comment) . This is to refactor sort merge join code-gen to depend on streamed/buffered terminology, which makes the code-gen agnostic to different join types and can be extended to support other join types than inner join.

Why are the changes needed?

Pre-requisite of #32476.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing unit test in InnerJoinSuite.scala for inner join code-gen.

c21 · 2021-05-10T22:44:04Z

cc @cloud-fan and @maropu , thanks.

SparkQA · 2021-05-10T23:59:19Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42865/

SparkQA · 2021-05-10T23:59:20Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42865/

c21 · 2021-05-11T00:52:53Z

Thank you @maropu for review. There was a maven compilation error in github action. It looks transient to me. Waiting for jenkins test result now.

maropu · 2021-05-11T01:08:08Z

Ah, Jenkins does not check the java11 compilation? I think It's be better to re-invoke the GA tests just in case for checking if the compilation is okay. (though I also think that's transient error as you said).

c21 · 2021-05-11T01:41:49Z

Thanks @maropu, restarted GA test here - https://github.com/c21/spark/actions/runs/829684422 .

maropu · 2021-05-11T02:20:35Z

okay, I've checked that it passed. I will merge this.

maropu · 2021-05-11T02:22:44Z

All the GA tests passed. Merged to master. Thank you, @c21

c21 · 2021-05-11T02:23:06Z

Thank you @maropu for monitoring and review!

SparkQA · 2021-05-11T03:44:32Z

Test build #138343 has finished for PR 32495 at commit 1912436.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
trait ShuffledJoin extends JoinCodegenSupport

cloud-fan · 2021-05-13T13:51:56Z

sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala

         |      }
-         |    } while ($leftRow != null);
+         |    } while ($streamedRow != null);
         |  }
         |  return false; // unreachable
         |}
       """.stripMargin, inlineToOuterClass = true)


If we have multiple SMJ in one whole-stage, will we have multiple findNextJoinRows methods in the outer class and fail the compilation?

This is applying to inner join before this PR as well. There won't be an issue and we will never have multiple findNextJoinRows in one class. The reason is with current design, sort merge join will always do code-gen for its children separately, so there won't be two SortMergeJoinExecs code-gen-ed in the same class.

Verified with example query:

val df1 = spark.range(10).select($"id".as("k1")) val df2 = spark.range(4).select($"id".as("k2")) val df3 = spark.range(6).select($"id".as("k3")) df3.join(df2.hint("SHUFFLE_MERGE"), $"k3" === $"k2", "left_outer") .join(df1.hint("SHUFFLE_MERGE"), $"k3" === $"k1", "right_outer") .explain("codegen")

Query plan:

*(8) SortMergeJoin [k3#10L], [k1#2L], RightOuter :- *(5) SortMergeJoin [k3#10L], [k2#6L], LeftOuter : :- *(2) Sort [k3#10L ASC NULLS FIRST], false, 0 : : +- Exchange hashpartitioning(k3#10L, 5), ENSURE_REQUIREMENTS, [id=#43] : : +- *(1) Project [id#8L AS k3#10L] : : +- *(1) Range (0, 6, step=1, splits=2) : +- *(4) Sort [k2#6L ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(k2#6L, 5), ENSURE_REQUIREMENTS, [id=#49] : +- *(3) Project [id#4L AS k2#6L] : +- *(3) Range (0, 4, step=1, splits=2) +- *(7) Sort [k1#2L ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(k1#2L, 5), ENSURE_REQUIREMENTS, [id=#58] +- *(6) Project [id#0L AS k1#2L] +- *(6) Range (0, 10, step=1, splits=2)

All generated code is in https://gist.github.com/c21/873775bcd08583105b289e67221f6e17.

got it, thanks!

worth to add a comment for it, in case we changed this in the future (codegen one side with SMJ together)

or we just make it future proof and create a fresh function name here.

@cloud-fan - thanks for calling it out. I would like to make it future proof by giving it a fresh name. Let me do it in a followup PR. Actually I was wondering the same question as you when implementing and spent some time to figuring it out.

…ead of hardcoding it ### What changes were proposed in this pull request? This is a followup from discussion in #32495 (comment) . The hardcoded function name `findNextJoinRows` is not a real problem now as we always do code generation for SMJ's children separately. But this change is to make it future proof in case this assumption changed in the future. ### Why are the changes needed? Fix the potential reliability issue. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing unit tests. Closes #32548 from c21/smj-followup. Authored-by: Cheng Su <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

Refactor sort merge join code-gen be agnostic to join type

1912436

github-actions bot added the SQL label May 10, 2021

c21 mentioned this pull request May 10, 2021

[SPARK-35349][SQL] Add code-gen for left/right outer sort merge join #32476

Closed

maropu approved these changes May 11, 2021

View reviewed changes

maropu closed this in c4ca232 May 11, 2021

c21 deleted the smj-refactor branch May 11, 2021 02:23

cloud-fan reviewed May 13, 2021

View reviewed changes

c21 mentioned this pull request May 14, 2021

[SPARK-35363][SQL][FOLLOWUP] Use fresh name for findNextJoinRows instead of hardcoding it #32548

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-35363][SQL] Refactor sort merge join code-gen be agnostic to join type #32495

[SPARK-35363][SQL] Refactor sort merge join code-gen be agnostic to join type #32495

c21 commented May 10, 2021

c21 commented May 10, 2021

SparkQA commented May 10, 2021

SparkQA commented May 10, 2021

c21 commented May 11, 2021

maropu commented May 11, 2021

c21 commented May 11, 2021

maropu commented May 11, 2021

maropu commented May 11, 2021

c21 commented May 11, 2021

SparkQA commented May 11, 2021

cloud-fan May 13, 2021

c21 May 13, 2021

cloud-fan May 14, 2021

cloud-fan May 14, 2021

cloud-fan May 14, 2021

c21 May 14, 2021

[SPARK-35363][SQL] Refactor sort merge join code-gen be agnostic to join type #32495

[SPARK-35363][SQL] Refactor sort merge join code-gen be agnostic to join type #32495

Conversation

c21 commented May 10, 2021

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

c21 commented May 10, 2021

SparkQA commented May 10, 2021

SparkQA commented May 10, 2021

c21 commented May 11, 2021

maropu commented May 11, 2021

c21 commented May 11, 2021

maropu commented May 11, 2021

maropu commented May 11, 2021

c21 commented May 11, 2021

SparkQA commented May 11, 2021

cloud-fan May 13, 2021

Choose a reason for hiding this comment

c21 May 13, 2021

Choose a reason for hiding this comment

cloud-fan May 14, 2021

Choose a reason for hiding this comment

cloud-fan May 14, 2021

Choose a reason for hiding this comment

cloud-fan May 14, 2021

Choose a reason for hiding this comment

c21 May 14, 2021

Choose a reason for hiding this comment