feat: support multi-column key declarations #6544

agavra · 2020-10-29T18:42:11Z

partial implementation for #824 - all that's left is lifting the config flag

Description

Well @big-andy-coates does it again. I dug into this and hit some walls the first times around making me think it would be really tough, but after a few scrap-my-work-and-try-again loops I have proved Andy right and myself wrong - it was actually much easier to support multi-column keys than I suspected!

There are still imitations (you cannot issue pull queries against a multi-column key table and you cannot join a multi-column source that cannot be repartition) and the existing limitations still hold (group by still causes an ugly concatenation of the key) but will be fixed in future releases.

Testing done

Mostly fixed unit tests, added multi-key-cols.json which tests most of the basic scenarios.

TODOs

After this PR we will still need to:

split up the mega ksql.key.format.enabled config into multiple that target smaller diffs
schema inference work
make sure avro/pb work (all tests were done in JSON)
fix semantics
joins, pull query support
partition by multiple column support

Reviewer checklist

Ensure docs are updated if necessary. (eg. if a user visible feature is being added or changed).
Ensure relevant issues are linked (description should include text like "Fixes #")

vcrfxia

Awesome stuff @agavra ! LGTM with the usual questions + nits inline.

ksqldb-engine/src/main/java/io/confluent/ksql/planner/plan/FinalProjectNode.java

vcrfxia · 2020-11-05T00:17:07Z

ksqldb-engine/src/main/java/io/confluent/ksql/planner/plan/FinalProjectNode.java

+        if (seenKey.getValue().size() > 1) {
+          final String keys = GrammaticalJoiner.and().join(
+              seenKey.getValue().stream().map(Name::text).sorted());
+          throw new KsqlException("The projection contains a key column (" + seenKey.getKey()


Not directly related to your PR, but can you help me understand why the validation that key columns are not selected more than once is performed in FinalProjectNode while the validation that key columns are selected is performed in DataSourceNode, rather than performing both checks in the same place?

Every node "validates" that the key is selected, they just all delegate to the child node that has that information:

// PlanNode.java:L145 void validateKeyPresent(final SourceName sinkName, final Projection projection) { getSources().forEach(s -> s.validateKeyPresent(sinkName, projection)); }

DataSourceNode, JoinNode and UserRepartitionNode all override this method because those are the nodes that truly "know" whether the key was selected. Imagine we projected the key column with a different name, the schema in the FinalProjectNode would consider that the key - but of course we didn't select that! We selected the original key.

To illustrate, in the example below the project node has a schema [col1 INT PRIMARY KEY] but we didn't select col1 - so it would fail! Of course we could do some magic and "remember" that b.col1 is actually a.id, but why not just delegate that to the node that already tracks that?

CREATE TABLE a (id INT PRIMARY KEY, col1 INT); CREATE TABLE b AS SELECT id AS col1;

(Note that it's a little confusing, because there is also VerifiableNode, which AggregateNode, FinalProjectNode and SuppressNode all override - and it is VerifiableNode#validateKeyPresent is what is called at the top level, but these just delegate down to the three source nodes eventually).

why the validation that key columns are not selected more than once is performed in FinalProjectNode

I think this one is clearer why it can't be in DataSourceNode, it's the other way around that might need some justification.

All that being said, I think this can definitely be made clearer going forward. The logical nodes don't follow a very strict abstraction model unfortunately.

Thanks for the explanation -- super helpful!

ksqldb-engine/src/main/java/io/confluent/ksql/planner/plan/FinalProjectNode.java

vcrfxia · 2020-11-05T00:17:55Z

ksqldb-execution/src/main/java/io/confluent/ksql/execution/util/StructKeyUtil.java

    }

-    public Struct build(final Object keyValue) {
+    public Struct build(final Object keyValue, final int fieldIndex) {


To check my understanding: this works today because we only ever end up with a single key after partition by or group by, but this method signature will have to change to support building structs with multiple key columns in a future PR, yes?

yes - I tried to change this to make it more generic but it causes a shocking amount of code jitter, so I'll do that in a future PR

vcrfxia · 2020-11-05T00:18:23Z

ksqldb-functional-tests/src/test/resources/query-validation-tests/multi-col-keys.json

+      }
+    },
+    {
+      "name": "nested struct key",


How is this is a nested struct key?

physically, it's a struct within a struct. I'll rename this "struct as column in multi-column key"

vcrfxia · 2020-11-05T00:18:26Z

ksqldb-functional-tests/src/test/resources/query-validation-tests/multi-col-keys.json

+      }
+    }
+  ]
+}


Can we also add a negative test for partition by on multiple columns?

I originally did not add this because our parser doesn't even except PARTITION BY x,y but i'll add one in and make sure it doesn't parse

ksqldb-streams/src/test/java/io/confluent/ksql/execution/streams/SourceBuilderTest.java

agavra force-pushed the multi-column-keys branch 3 times, most recently from 7b75de3 to 20d2c78 Compare October 29, 2020 21:44

agavra changed the title ~~feat: support multi-column key declarations [DRAFT - testing still in progress]~~ feat: support multi-column key declarations Oct 29, 2020

agavra force-pushed the multi-column-keys branch from 20d2c78 to 36ded16 Compare October 29, 2020 22:04

agavra marked this pull request as ready for review October 29, 2020 22:07

agavra requested a review from a team as a code owner October 29, 2020 22:07

agavra requested a review from vcrfxia October 29, 2020 22:07

agavra force-pushed the multi-column-keys branch 2 times, most recently from 348b36b to 20d2c78 Compare October 30, 2020 17:40

vcrfxia approved these changes Nov 5, 2020

View reviewed changes

agavra added 3 commits November 5, 2020 15:39

feat: support multi-column key declarations

b801790

chore: victorias comments

3b251dd

chore: rebase

f058668

agavra force-pushed the multi-column-keys branch from 3362d09 to f058668 Compare November 5, 2020 23:43

agavra merged commit 2e1023e into confluentinc:master Nov 6, 2020

agavra deleted the multi-column-keys branch November 6, 2020 01:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support multi-column key declarations #6544

feat: support multi-column key declarations #6544

agavra commented Oct 29, 2020 •

edited

Loading

vcrfxia left a comment

vcrfxia Nov 5, 2020

agavra Nov 5, 2020

vcrfxia Nov 5, 2020

vcrfxia Nov 5, 2020

agavra Nov 5, 2020

vcrfxia Nov 5, 2020

agavra Nov 5, 2020

vcrfxia Nov 5, 2020

agavra Nov 5, 2020

feat: support multi-column key declarations #6544

feat: support multi-column key declarations #6544

Conversation

agavra commented Oct 29, 2020 • edited Loading

Description

Testing done

TODOs

Reviewer checklist

vcrfxia left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

agavra commented Oct 29, 2020 •

edited

Loading