docs: add klip-24: key column semantics in queries. #5115

big-andy-coates · 2020-04-20T13:35:19Z

This klip looks to address some of the shortcomings found during the recent work to remove the restriction that all key columns must be named ROWKEY.

Please have a read and let me know your thoughts.

Importantly, the 'any key name' work is not yet enabled, (See #5093). So we have a choice... do we fix these semantics before or after we enable this feature?

Before means less disruption for users, but may mean the feature slips the milestone.

After means we'll hit the milestone, but users will be asked to change their existing queries one way, only to be asked to change them back on the next release ... annoying! For example:

-- existing valid GROUP BY persistent query:
CREATE TABLE OUTPUT AS 
   SELECT V0, COUNT() AS COUNT 
     FROM INPUT GROUP BY V0;

-- with 'any key name' merged the above fails with a duplicate column error on V0.
-- the user needs to change the query to:
CREATE TABLE OUTPUT AS 
   SELECT COUNT() AS COUNT 
     FROM INPUT GROUP BY V0;

-- with this klip the first statement is valid, but the second is not, hence we'd be asking users to change their query back again.

big-andy-coates · 2020-04-20T13:36:26Z

cc: @MichaelDrogalis @derekjn @agavra @apurvam @blueedgenick @rmoff

agavra

Thanks @big-andy-coates! I like all of the "main" changes proposed in this KLIP, my comments are mostly on the "secondary" and TBD parts.

Before means less disruption for users, but may mean the feature slips the milestone.

This is my vote, I think the current behavior for any keys would turn off some users - and the confusion for users jumping between models would be really frustrating.

design-proposals/klip-24-key-column-semantics-in-queries.md

update with Almog's requested changes and suggestions and doc another edge case

big-andy-coates · 2020-04-21T13:42:30Z

@mjsax & @agavra : I've updated this inline with discussions and also discovered another edge case: grouping by multiple expressions. Can you please take a look at this section and provide any feedback? Thanks!

rmoff

left a few comments but in general this all looks good 👍

design-proposals/klip-24-key-column-semantics-in-queries.md

agavra

LGTM!

derekjn · 2020-04-21T20:52:03Z

Great proposal @big-andy-coates, I think this makes things much cleaner and more explicit. The only question I have has been asked by @rmoff: #5115 (comment)

Otherwise LGTM 👍

big-andy-coates · 2020-04-22T12:23:00Z

Great proposal @big-andy-coates, I think this makes things much cleaner and more explicit. The only question I have has been asked by @rmoff: #5115 (comment)

Otherwise LGTM 👍

@derekjn, replied to Robin's comment: it's actually a editing mistake in the KLIP.

big-andy-coates · 2020-04-23T10:15:30Z

I've updated the KLIP with a new edge case around outer joins and joins on non-column-refs. The proposed short to medium term work around is not pretty, but is practical.

Would appreciate peoples views. cc @rmoff, @blueedgenick, @agavra, @derekjn, @apurvam

agavra

Still LGTM

design-proposals/klip-24-key-column-semantics-in-queries.md

mjsax · 2020-04-24T16:45:16Z

design-proposals/klip-24-key-column-semantics-in-queries.md

+alias for the system generated `KSQL_COL_0` key column name. Any solution to allow providing an
+alias would likely be incompatible with the planned multiple key column support. 
+
+Hence, we propose leaving this edge case unsolved, i.e. users will _not_ be able to provide an alias


As the proposal is to use ROWKEY column for outer joins, it seems the same pattern could be applied for this case to allow people to rename the PK?

Hummmm.... you've got a point.

However, I think there's a subtle difference we may want to consider.

With a grouping statement on multiple things, the current implementation combines all the result columns into a single STRING value. So the Kafka message's key contains all the grouping data, just in a nasty format. No additional column is being synthesised.

Conversely, for outer joins the Kafka message's key contains the result of COALESCE(leftJoinExp, rightJoinExp), i.e. unlike the grouping statement, it does NOT contain all the joining data: it loses the data about any side being null.

Because of this subtle difference we know that the upcoming structured keys work will allow users to alias the multiple grouping expressions in the projection. However, the issue with joins can not be fixed with structured key support alone.

If we go the same route for groupings as we've proposed for joins, then we end up with:

CREATE TABLE OUTPUT AS SELECT ROWKEY AS K, COUNT() FROM INPUT GROUP BY V0, V1; -- vs -- CREATE TABLE OUTPUT AS SELECT V0, V1, COUNT() FROM INPUT GROUP BY V0, V1;

There's certainly arguments for going either way. Personally, I'm happy with what's been proposed for groupings in the KLIP.

cc @derekjn for a product view.

I guess we agree that the end-state should be the last query you showed:

CREATE TABLE OUTPUT AS SELECT V0, V1, COUNT() FROM INPUT GROUP BY V0, V1;

(With a proper structured key <V0,V1> stored in the message key).

I guess the question is what intermediate state we want to be in. The query from above is a valid query now however, it does not really expose the PK that is stored in the ROWKEY.

From the KLIP:

we propose that the projection should still accept the individual columns, and recognise them as key columns

Not sure if I can follow. If both columns V0 and V1 are store in the value, how can this be done?

As this KLIP seems to try to expose the actual message key in the schema, it seems consequent to add ROWKEY for this case as an intermediate step. I understand, that it might look like a step backward for a language POV as the above query that is valid now, would not be valid until we reach the end-state and it becomes valid again...

To be clear the KLIP is proposing that the following will be valid even before structured keys:

CREATE TABLE OUTPUT AS SELECT V0, V1, COUNT() FROM INPUT GROUP BY V0, V1;

And both V0 and V1 will be stored in the key in a munged together column called KSQL_COL_0, not in the value.

Yes, we could support:

CREATE TABLE OUTPUT AS SELECT KSQL_COL_0, COUNT() FROM INPUT GROUP BY V0, V1;

But sures will be left wondering where KSQL_COL_0 came from.

Given than we're adding support for multiple key columns next, product are happy with not supporting aliasing of the resulting key column name.

design-proposals/klip-24-key-column-semantics-in-queries.md

Part of [klip-24](confluentinc#5115). A Udf that indicates that a key column in a projection should be copied into a value column, for example: ```sql -- Given: CREATE STREAM INPUT (ID INT KEY, V0 INT, V1 INT) WITH (kafka_topic='input', value_format='JSON'); -- When: CREATE STREAM OUTPUT AS SELECT ID, AS_VALUE(ID) AS ID_COPY, V1 FROM INPUT; -- Then: -- resulting schema: ID INT KEY, ID_COPY INT, V1 INT ``` Note, the UDF doesn't actually _do_ anything as yet. It requires the request of klip-24 to enable its true purpose.

design-proposals/klip-24-key-column-semantics-in-queries.md

* chore: add AS_VALUE Udf Part of [klip-24](#5115). A Udf that indicates that a key column in a projection should be copied into a value column, for example: ```sql -- Given: CREATE STREAM INPUT (ID INT KEY, V0 INT, V1 INT) WITH (kafka_topic='input', value_format='JSON'); -- When: CREATE STREAM OUTPUT AS SELECT ID, AS_VALUE(ID) AS ID_COPY, V1 FROM INPUT; -- Then: -- resulting schema: ID INT KEY, ID_COPY INT, V1 INT ``` Co-authored-by: Andy Coates <[email protected]>

This change implements the change in key semantics in queries outlined in [KLIP-24](confluentinc#5115).

* chore: implement new key semantics in queries This change implements the change in key semantics in queries outlined in [KLIP-24](#5115). Co-authored-by: Andy Coates <[email protected]>

mjsax

Thanks for the KLIP!

S-makes

design-proposals/klip-24-key-column-semantics-in-queries.md

docs: add klip-24

80f66e2

big-andy-coates requested a review from a team as a code owner April 20, 2020 13:35

agavra reviewed Apr 20, 2020

View reviewed changes

agavra requested a review from a team April 20, 2020 16:48

chore: removal of the optional about removing the right key column

8a6151e

This was referenced Apr 20, 2020

Any key name #5093

Merged

Support MVs without key column. #5125

Open

chore: update

6c76978

update with Almog's requested changes and suggestions and doc another edge case

add group by aliasing as possible solution

48273b7

rmoff approved these changes Apr 21, 2020

View reviewed changes

agavra approved these changes Apr 21, 2020

View reviewed changes

big-andy-coates added 2 commits April 22, 2020 13:00

remove incorrect use of aliasing on multi-column grouping

b02b654

chore: it's so unfair.

da2ec54

chore: add outer join edge case

1bbd95e

agavra approved these changes Apr 23, 2020

View reviewed changes

design-proposals/klip-24-key-column-semantics-in-queries.md Show resolved Hide resolved

big-andy-coates added 2 commits April 24, 2020 16:37

Update klip-24-key-column-semantics-in-queries.md

1599a97

Update klip-24-key-column-semantics-in-queries.md

772ccf8

mjsax reviewed Apr 24, 2020

View reviewed changes

design-proposals/klip-24-key-column-semantics-in-queries.md Outdated Show resolved Hide resolved

mjsax reviewed Apr 24, 2020

View reviewed changes

design-proposals/klip-24-key-column-semantics-in-queries.md Outdated Show resolved Hide resolved

mjsax reviewed Apr 24, 2020

View reviewed changes

design-proposals/klip-24-key-column-semantics-in-queries.md Outdated Show resolved Hide resolved

mjsax reviewed Apr 24, 2020

View reviewed changes

design-proposals/klip-24-key-column-semantics-in-queries.md Outdated Show resolved Hide resolved

mjsax reviewed Apr 24, 2020

View reviewed changes

design-proposals/klip-24-key-column-semantics-in-queries.md Show resolved Hide resolved

mjsax reviewed Apr 24, 2020

View reviewed changes

derekjn approved these changes Apr 24, 2020

View reviewed changes

big-andy-coates mentioned this pull request Apr 26, 2020

Breaking Change: Don't include ROWKEY in join schema when selecting all columns #3731

Closed

big-andy-coates mentioned this pull request Apr 26, 2020

klip-24 key semantics #5191

Closed

chore: updated with mjsax's requested changes

a337830

big-andy-coates commented Apr 27, 2020

View reviewed changes

design-proposals/klip-24-key-column-semantics-in-queries.md Outdated Show resolved Hide resolved

big-andy-coates mentioned this pull request Apr 27, 2020

chore: add AS_VALUE Udf #5194

Merged

2 tasks

chore: update to use JOINKEY udf for synthetic join columns

3bde52b

mjsax reviewed Apr 27, 2020

View reviewed changes

design-proposals/klip-24-key-column-semantics-in-queries.md Show resolved Hide resolved

mjsax reviewed Apr 27, 2020

View reviewed changes

design-proposals/klip-24-key-column-semantics-in-queries.md Show resolved Hide resolved

mjsax reviewed Apr 27, 2020

View reviewed changes

design-proposals/klip-24-key-column-semantics-in-queries.md Outdated Show resolved Hide resolved

mjsax reviewed Apr 27, 2020

View reviewed changes

design-proposals/klip-24-key-column-semantics-in-queries.md Outdated Show resolved Hide resolved

mjsax reviewed Apr 27, 2020

View reviewed changes

design-proposals/klip-24-key-column-semantics-in-queries.md Outdated Show resolved Hide resolved

big-andy-coates added a commit to big-andy-coates/ksql that referenced this pull request May 4, 2020

chore: implement new key semantics in queries

fd29558

This change implements the change in key semantics in queries outlined in [KLIP-24](confluentinc#5115).

big-andy-coates and others added 2 commits May 6, 2020 19:04

chore: updated with latest comments

519a7b7

Merge branch 'master' into klip-24-key-column-query-semantics

2698f8c

big-andy-coates mentioned this pull request May 7, 2020

Persistent queries on tables should require key columns #5303

Closed

big-andy-coates merged commit bd9302a into confluentinc:master May 7, 2020

big-andy-coates deleted the klip-24-key-column-query-semantics branch May 7, 2020 16:01

mjsax approved these changes May 7, 2020

View reviewed changes

S-makes reviewed Mar 29, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: add klip-24: key column semantics in queries. #5115

docs: add klip-24: key column semantics in queries. #5115

big-andy-coates commented Apr 20, 2020

big-andy-coates commented Apr 20, 2020

agavra left a comment

big-andy-coates commented Apr 21, 2020

rmoff left a comment

agavra left a comment

derekjn commented Apr 21, 2020 •

edited

Loading

big-andy-coates commented Apr 22, 2020

big-andy-coates commented Apr 23, 2020

agavra left a comment

mjsax Apr 24, 2020

big-andy-coates Apr 27, 2020

mjsax Apr 27, 2020

big-andy-coates May 6, 2020 •

edited

Loading

mjsax left a comment

S-makes left a comment

docs: add klip-24: key column semantics in queries. #5115

docs: add klip-24: key column semantics in queries. #5115

Conversation

big-andy-coates commented Apr 20, 2020

big-andy-coates commented Apr 20, 2020

agavra left a comment

Choose a reason for hiding this comment

big-andy-coates commented Apr 21, 2020

rmoff left a comment

Choose a reason for hiding this comment

agavra left a comment

Choose a reason for hiding this comment

derekjn commented Apr 21, 2020 • edited Loading

big-andy-coates commented Apr 22, 2020

big-andy-coates commented Apr 23, 2020

agavra left a comment

Choose a reason for hiding this comment

mjsax Apr 24, 2020

Choose a reason for hiding this comment

big-andy-coates Apr 27, 2020

Choose a reason for hiding this comment

mjsax Apr 27, 2020

Choose a reason for hiding this comment

big-andy-coates May 6, 2020 • edited Loading

Choose a reason for hiding this comment

mjsax left a comment

Choose a reason for hiding this comment

S-makes left a comment

Choose a reason for hiding this comment

derekjn commented Apr 21, 2020 •

edited

Loading

big-andy-coates May 6, 2020 •

edited

Loading