-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: support multi-column key declarations #6544
Conversation
7b75de3
to
20d2c78
Compare
20d2c78
to
36ded16
Compare
348b36b
to
20d2c78
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome stuff @agavra ! LGTM with the usual questions + nits inline.
ksqldb-engine/src/main/java/io/confluent/ksql/planner/plan/FinalProjectNode.java
Outdated
Show resolved
Hide resolved
if (seenKey.getValue().size() > 1) { | ||
final String keys = GrammaticalJoiner.and().join( | ||
seenKey.getValue().stream().map(Name::text).sorted()); | ||
throw new KsqlException("The projection contains a key column (" + seenKey.getKey() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not directly related to your PR, but can you help me understand why the validation that key columns are not selected more than once is performed in FinalProjectNode
while the validation that key columns are selected is performed in DataSourceNode
, rather than performing both checks in the same place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Every node "validates" that the key is selected, they just all delegate to the child node that has that information:
// PlanNode.java:L145
void validateKeyPresent(final SourceName sinkName, final Projection projection) {
getSources().forEach(s -> s.validateKeyPresent(sinkName, projection));
}
DataSourceNode
, JoinNode
and UserRepartitionNode
all override this method because those are the nodes that truly "know" whether the key was selected. Imagine we projected the key column with a different name, the schema in the FinalProjectNode
would consider that the key - but of course we didn't select that! We selected the original key.
To illustrate, in the example below the project node has a schema [col1 INT PRIMARY KEY]
but we didn't select col1
- so it would fail! Of course we could do some magic and "remember" that b.col1
is actually a.id
, but why not just delegate that to the node that already tracks that?
CREATE TABLE a (id INT PRIMARY KEY, col1 INT);
CREATE TABLE b AS SELECT id AS col1;
(Note that it's a little confusing, because there is also VerifiableNode
, which AggregateNode
, FinalProjectNode
and SuppressNode
all override - and it is VerifiableNode#validateKeyPresent
is what is called at the top level, but these just delegate down to the three source nodes eventually).
why the validation that key columns are not selected more than once is performed in
FinalProjectNode
I think this one is clearer why it can't be in DataSourceNode
, it's the other way around that might need some justification.
All that being said, I think this can definitely be made clearer going forward. The logical nodes don't follow a very strict abstraction model unfortunately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation -- super helpful!
ksqldb-engine/src/main/java/io/confluent/ksql/planner/plan/FinalProjectNode.java
Outdated
Show resolved
Hide resolved
} | ||
|
||
public Struct build(final Object keyValue) { | ||
public Struct build(final Object keyValue, final int fieldIndex) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To check my understanding: this works today because we only ever end up with a single key after partition by or group by, but this method signature will have to change to support building structs with multiple key columns in a future PR, yes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes - I tried to change this to make it more generic but it causes a shocking amount of code jitter, so I'll do that in a future PR
} | ||
}, | ||
{ | ||
"name": "nested struct key", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this is a nested struct key?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
physically, it's a struct within a struct. I'll rename this "struct as column in multi-column key"
} | ||
} | ||
] | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also add a negative test for partition by on multiple columns?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I originally did not add this because our parser doesn't even except PARTITION BY x,y
but i'll add one in and make sure it doesn't parse
ksqldb-streams/src/test/java/io/confluent/ksql/execution/streams/SourceBuilderTest.java
Outdated
Show resolved
Hide resolved
ksqldb-streams/src/test/java/io/confluent/ksql/execution/streams/SourceBuilderTest.java
Outdated
Show resolved
Hide resolved
3362d09
to
f058668
Compare
partial implementation for #824 - all that's left is lifting the config flag
Description
Well @big-andy-coates does it again. I dug into this and hit some walls the first times around making me think it would be really tough, but after a few scrap-my-work-and-try-again loops I have proved Andy right and myself wrong - it was actually much easier to support multi-column keys than I suspected!
There are still imitations (you cannot issue pull queries against a multi-column key table and you cannot join a multi-column source that cannot be repartition) and the existing limitations still hold (group by still causes an ugly concatenation of the key) but will be fixed in future releases.
Testing done
Mostly fixed unit tests, added
multi-key-cols.json
which tests most of the basic scenarios.TODOs
After this PR we will still need to:
ksql.key.format.enabled
config into multiple that target smaller diffsReviewer checklist