-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: add klip-24: key column semantics in queries. #5115
docs: add klip-24: key column semantics in queries. #5115
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @big-andy-coates! I like all of the "main" changes proposed in this KLIP, my comments are mostly on the "secondary" and TBD parts.
Before means less disruption for users, but may mean the feature slips the milestone.
This is my vote, I think the current behavior for any keys would turn off some users - and the confusion for users jumping between models would be really frustrating.
update with Almog's requested changes and suggestions and doc another edge case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left a few comments but in general this all looks good 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Great proposal @big-andy-coates, I think this makes things much cleaner and more explicit. The only question I have has been asked by @rmoff: #5115 (comment) Otherwise LGTM 👍 |
@derekjn, replied to Robin's comment: it's actually a editing mistake in the KLIP. |
I've updated the KLIP with a new edge case around outer joins and joins on non-column-refs. The proposed short to medium term work around is not pretty, but is practical. Would appreciate peoples views. cc @rmoff, @blueedgenick, @agavra, @derekjn, @apurvam |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still LGTM
alias for the system generated `KSQL_COL_0` key column name. Any solution to allow providing an | ||
alias would likely be incompatible with the planned multiple key column support. | ||
|
||
Hence, we propose leaving this edge case unsolved, i.e. users will _not_ be able to provide an alias |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As the proposal is to use ROWKEY
column for outer joins, it seems the same pattern could be applied for this case to allow people to rename the PK?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hummmm.... you've got a point.
However, I think there's a subtle difference we may want to consider.
With a grouping statement on multiple things, the current implementation combines all the result columns into a single STRING
value. So the Kafka message's key contains all the grouping data, just in a nasty format. No additional column is being synthesised.
Conversely, for outer joins the Kafka message's key contains the result of COALESCE(leftJoinExp, rightJoinExp)
, i.e. unlike the grouping statement, it does NOT contain all the joining data: it loses the data about any side being null
.
Because of this subtle difference we know that the upcoming structured keys work will allow users to alias the multiple grouping expressions in the projection. However, the issue with joins can not be fixed with structured key support alone.
If we go the same route for groupings as we've proposed for joins, then we end up with:
CREATE TABLE OUTPUT AS
SELECT ROWKEY AS K, COUNT() FROM INPUT GROUP BY V0, V1;
-- vs --
CREATE TABLE OUTPUT AS
SELECT V0, V1, COUNT() FROM INPUT GROUP BY V0, V1;
There's certainly arguments for going either way. Personally, I'm happy with what's been proposed for groupings in the KLIP.
cc @derekjn for a product view.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we agree that the end-state should be the last query you showed:
CREATE TABLE OUTPUT AS
SELECT V0, V1, COUNT() FROM INPUT GROUP BY V0, V1;
(With a proper structured key <V0,V1>
stored in the message key).
I guess the question is what intermediate state we want to be in. The query from above is a valid query now however, it does not really expose the PK that is stored in the ROWKEY
.
From the KLIP:
we propose that the projection should still accept the individual columns, and recognise them as key columns
Not sure if I can follow. If both columns V0 and V1 are store in the value, how can this be done?
As this KLIP seems to try to expose the actual message key in the schema, it seems consequent to add ROWKEY
for this case as an intermediate step. I understand, that it might look like a step backward for a language POV as the above query that is valid now, would not be valid until we reach the end-state and it becomes valid again...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear the KLIP is proposing that the following will be valid even before structured keys:
CREATE TABLE OUTPUT AS
SELECT V0, V1, COUNT() FROM INPUT GROUP BY V0, V1;
And both V0
and V1
will be stored in the key in a munged together column called KSQL_COL_0
, not in the value.
Yes, we could support:
CREATE TABLE OUTPUT AS
SELECT KSQL_COL_0, COUNT() FROM INPUT GROUP BY V0, V1;
But sures will be left wondering where KSQL_COL_0
came from.
Given than we're adding support for multiple key columns next, product are happy with not supporting aliasing of the resulting key column name.
Part of [klip-24](confluentinc#5115). A Udf that indicates that a key column in a projection should be copied into a value column, for example: ```sql -- Given: CREATE STREAM INPUT (ID INT KEY, V0 INT, V1 INT) WITH (kafka_topic='input', value_format='JSON'); -- When: CREATE STREAM OUTPUT AS SELECT ID, AS_VALUE(ID) AS ID_COPY, V1 FROM INPUT; -- Then: -- resulting schema: ID INT KEY, ID_COPY INT, V1 INT ``` Note, the UDF doesn't actually _do_ anything as yet. It requires the request of klip-24 to enable its true purpose.
* chore: add AS_VALUE Udf Part of [klip-24](#5115). A Udf that indicates that a key column in a projection should be copied into a value column, for example: ```sql -- Given: CREATE STREAM INPUT (ID INT KEY, V0 INT, V1 INT) WITH (kafka_topic='input', value_format='JSON'); -- When: CREATE STREAM OUTPUT AS SELECT ID, AS_VALUE(ID) AS ID_COPY, V1 FROM INPUT; -- Then: -- resulting schema: ID INT KEY, ID_COPY INT, V1 INT ``` Co-authored-by: Andy Coates <[email protected]>
This change implements the change in key semantics in queries outlined in [KLIP-24](confluentinc#5115).
* chore: implement new key semantics in queries This change implements the change in key semantics in queries outlined in [KLIP-24](#5115). Co-authored-by: Andy Coates <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the KLIP!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
design-proposals/klip-24-key-column-semantics-in-queries.md
This klip looks to address some of the shortcomings found during the recent work to remove the restriction that all key columns must be named
ROWKEY
.Please have a read and let me know your thoughts.
Importantly, the 'any key name' work is not yet enabled, (See #5093). So we have a choice... do we fix these semantics before or after we enable this feature?
Before means less disruption for users, but may mean the feature slips the milestone.
After means we'll hit the milestone, but users will be asked to change their existing queries one way, only to be asked to change them back on the next release ... annoying! For example: