feat: expression support in JOINs #4278

agavra · 2020-01-11T00:35:16Z

Description

This PR introduces the ability to do arbitrary expressions in join conditions as long as on each side of the equality exactly one join source is used, and both join sources are represented. As an added bonus, this feature actually removes more code than it adds!

To implement this, I simply removed all repartition logic from the JoinNode and instead added a repartition step (reusing the recently introduced RepartitionNode) to each of the source nodes. This path is always hit, if no repartition is required then the repartition node simply returns the same source as was passed into it. Now, the JoinNode always joins on the keys of its input.

Review Guide

To review, I recommend first looking at the Analyzer which has been simplified just to check the condition outlined above. Then, look at the LogicalPlanner and finally at the JoinNode.

Testing done

Updated unit tests
Updated all the QTT tests that failed when expressions were present in the join criteria

Reviewer checklist

Ensure docs are updated if necessary. (eg. if a user visible feature is being added or changed).
Ensure relevant issues are linked (description should include text like "Fixes #")

ksql-engine/src/main/java/io/confluent/ksql/planner/LogicalPlanner.java

rodesai · 2020-01-14T00:30:55Z

ksql-engine/src/main/java/io/confluent/ksql/planner/LogicalPlanner.java

+        // a no-op if a repartition is not required. if the source is a table, and
+        // a repartition is needed, then an exception will be thrown
+        buildRepartitionNode(leftSourceNode, joinInfo.get().getLeftJoinExpression()),
+        buildRepartitionNode(rightSourceNode, joinInfo.get().getRightJoinExpression()),


I'm not sure adding repartition nodes in the logical plan is the right way to go here. IMO the transformation performed by the nodes in the logical plan should correspond closely to the semantics of the corresponding transformation in the query, e.g.

RepartitionNode implements PARTITION BY
ProjectNode implements SELECT <ITEMS>
FilterNode implements WHERE <expr>

I see two problems in this particular case:

The semantics of the repartition for the join might be different from the semantics of the repartition needed to perform a PARTITION BY. A PARTITION BY transforms the schema in a very specific way that may not make sense here. It's kind of hard to give a concrete example since we're still deciding on PARTITION BY semantics. But one example might be adding default column names for the new key columns (depending on the semantics we choose for it). If PARTITION BY semantics support adding columns in the PARTITION BY, those columns should get some generated name. But it wouldn't make sense to add those columns for the repartition done here.

The repartition here is really an implementation detail of the join. One place that comes to mind where this might come up down the line: we've discussed changing the join semantics to preserve the original key of the left side. The current behavior is really exposing implementation details in our query semantics. If we did that it would be awkward to have the original rekey in these repartition nodes and the post-join rekey in JoinNode

IMO the transformation performed by the nodes in the logical plan should correspond closely to the semantics of the corresponding transformation in the query

My knowledge on SQL theory and relational algebra is pretty minimal, but I feel that the fact that the nodes mirror the SQL statements is coincidental (a reflection of how new our codebase is) not characteristic. Most databases have a step that optimizes the logical plan independent of the physical plan so the output of the logical planner may only loosely resemble the original SQL query. Otherwise, why have the extra step to translate between the query nodes and the logical plan nodes?

Instead, I think the logical nodes should define steps that describe the logical transformation and they should constitute the smallest reusable unit of logical work - otherwise reusing (composing) them and optimizing complex queries becomes unmanageable. The LogicalPlanner's job is to turn a SQL query into a logical plan.

The semantics of the repartition for the join might be different from the semantics of the repartition needed to perform a PARTITION BY ... If PARTITION BY semantics support adding columns in the PARTITION BY, those columns should get some generated name. But it wouldn't make sense to add those columns for the repartition done here.

This falls pretty well into my view of logical plan. In my view, if PARTITION BY supports adding columns then a single PARTITION BY should map to two logical nodes - a projection node (or a rename) and a partition node.

The repartition here is really an implementation detail of the join.

I think things get a little tricky with PARTITION BY because it's the only step that spans the logical and the physical world. Really, we don't have a "RekeyNode" (which is perhaps a better name for the RepartitionNode) - but we have defined our Joins as only working on streams with the same key (and this a logical, not a physical, requirement).

If we did that it would be awkward to have the original rekey in these repartition nodes and the post-join rekey in JoinNode

It wouldn't be in the JoinNode - it would be yet another RepartitionNode (which, again, should probably be named RekeyNode).

ksql-engine/src/main/java/io/confluent/ksql/structured/SchemaKStream.java

ksql-engine/src/main/java/io/confluent/ksql/analyzer/ExpressionAnalyzer.java

rodesai · 2020-01-14T00:39:22Z

ksql-engine/src/main/java/io/confluent/ksql/analyzer/Analyzer.java

+      try {
+        return columnRefs.stream()
+            .map(ColumnRef::source)
+            .filter(Optional::isPresent)


If the column is not qualified, don't you need to check that it belongs to exactly one source, and include that source in the stream? If we are assuming all the columns are qualified then we should throw if the source is empty.

it is checked earlier that it belongs to only one source (in ExpressionAnalyzer#throwOnUnknownOrAmbiguousColumn) - I'll add a comment explaining this

EDIT: actually, there's a possibility that they're all unambiguous but also not qualified. I'll make the ExpressionAnalyzer add the qualifier.

ksql-engine/src/main/java/io/confluent/ksql/analyzer/Analyzer.java

rodesai

LGTM

agavra requested a review from big-andy-coates January 11, 2020 00:35

agavra requested a review from a team as a code owner January 11, 2020 00:35

feat: expression support in JOINs

68496bf

agavra force-pushed the join_exp branch from 8034781 to 68496bf Compare January 13, 2020 18:58

rodesai self-assigned this Jan 13, 2020

rodesai reviewed Jan 14, 2020

View reviewed changes

fix: rohans requested changes

164130c

agavra requested a review from rodesai January 14, 2020 22:46

rodesai approved these changes Jan 15, 2020

View reviewed changes

agavra merged commit 2d0bfe8 into confluentinc:master Jan 15, 2020

agavra deleted the join_exp branch January 15, 2020 00:18

fish-face mentioned this pull request Feb 28, 2020

Support implicit repartitioning of table sources during joins #4666

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: expression support in JOINs #4278

feat: expression support in JOINs #4278

agavra commented Jan 11, 2020

rodesai Jan 14, 2020 •

edited

Loading

agavra Jan 14, 2020

rodesai Jan 14, 2020

agavra Jan 14, 2020 •

edited

Loading

rodesai left a comment

feat: expression support in JOINs #4278

feat: expression support in JOINs #4278

Conversation

agavra commented Jan 11, 2020

Description

Review Guide

Testing done

Reviewer checklist

rodesai Jan 14, 2020 • edited Loading

Choose a reason for hiding this comment

agavra Jan 14, 2020

Choose a reason for hiding this comment

rodesai Jan 14, 2020

Choose a reason for hiding this comment

agavra Jan 14, 2020 • edited Loading

Choose a reason for hiding this comment

rodesai left a comment

Choose a reason for hiding this comment

rodesai Jan 14, 2020 •

edited

Loading

agavra Jan 14, 2020 •

edited

Loading