docs: klip-33 key format #6017

big-andy-coates · 2020-08-13T18:02:18Z

KLIP adding KEY_FORMAT and other things to the language.

agavra

LGTM!

design-proposals/klip-33-key-format.md

agavra · 2020-08-13T19:42:50Z

design-proposals/klip-33-key-format.md

+Kafka micro site examples to leverage the new functionality, as these have automated testing.  
+It may be worth changing the ksqlDB quickstart too - TBD, as this will require extending DataGen 
+to support other key formats. Something we may want in scope anyway - or should be end-of-life 
+DataGen in favour of the datagen connector?


Please don't eol the datagen tool!
It's far too useful for anyone who wants to test or demo something quickly. And before someone says "you can just run an embedded datagen connector though so what's the difference?" :-), allow me to opine that this requires a steeper learning curve and advanced troubleshooting skills to get working right, which I think should be avoided where possible. Just an opinion of course ;)
I'd also say that the connector should be refactored (as i recall suggesting loudly when it was originally forked off from here) so that it embeds some re-usable portion of the datagen tool (which likely requires a small refactor of datagen itself too, to facilitate, to be fair) rather than being a copy/paste that now proceeds on its own life journey and inevitable divergence.

@blueedgenick are you putting a case forward that DataGen should be enhanced to support Avro / Json keys as part of this work, i.e. it should be in-scaope?

design-proposals/klip-33-key-format.md

vcrfxia

LGTM! Thanks @big-andy-coates

design-proposals/klip-33-key-format.md

big-andy-coates · 2020-08-20T15:54:58Z

@agavra @vcrfxia I've added you back in for a review as there have been some changes, and I've found some more edge cases that need more thought.

See outstanding questions in the description at the top of this page, and also would be great if you could look at and think about the implementation section where it talks about deciding which side to repartition if key formats don't match.

agavra · 2020-08-20T16:16:59Z

tl;dr I think the KLIP is good whichever way you decide to go on the outstanding issues. I have some thoughts below if you need some tie-breaker on what you're intuition tells you.

* DataGen: Do we end-of-life our DataGen in favour of the datagen connector or update DataGen to support other key formats?

IMO, deprecating DataGen is scope creep that we should avoid if possible. If a user really wants to generate keys in other formats, I'm happy just leaving it publishing only KAFKA keys and then asking them to manually CREATE STREAM bar WITH(key_format='AVRO') AS SELECT * FROM foo . This is probably a product decision though, but I don't think we should block the KLIP on it.

* Cost based optimiser: Do we price the cost of repartitioning tables slightly less than streams? Based on the BIG assumption that streams, on average, see higher throughput than tables.

🤷 (responding to this and impl section on choosing key) Whichever decision we make will upset some people and be sub-optimal for their use cases. I think it might make sense to default to repartition the table, and then do what @mjsax suggested to repartition the stream if it's getting re-partitioned anyway. I wouldn't over-index on this because it's likely that organizations use the same data format throughout, so I suspect that this scenario will be somewhat rare.

* How to handle key-less streams? If the default key format is one that supports the schema registry, should we / can we register an empty schema? If not, how can we differentiate a missing schema, i.e. an error, from a key-less stream?

🤔 that's an interesting scenario. I think it makes sense to force keyless streams to be KAFKA format (or, we could create an alias NULL if we want to make it easier to read) and throw an error if they specify a different key format and no key columns with something like key format 'AVRO' does not support empty keys, either use key_format='NULL' or specify a (PRIMARY) KEY column

big-andy-coates · 2020-08-20T16:24:54Z

I wouldn't over-index on this because it's likely that organizations use the same data format throughout, so I suspect that this scenario will be somewhat rare.

Good point. We could just NOT support it...

For keyless streams, I've added a proposal to the KLIP:https://github.com/confluentinc/ksql/blob/4ca0788edb5e72537890bdb1c89628e039680320/design-proposals/klip-33-key-format.md#schema-inference. Just thinking out loud really...

The NULL key_format is probably an easier solution though! ;)

vcrfxia · 2020-08-20T17:48:18Z

IMO, deprecating DataGen is scope creep that we should avoid if possible. If a user really wants to generate keys in other formats, I'm happy just leaving it publishing only KAFKA keys and then asking them to manually CREATE STREAM bar WITH(key_format='AVRO') AS SELECT * FROM foo .

+1 to this. I don't think we should invest the effort to enhance datagen to support multiple key formats (and types) unless we receive user requests to do so. I'm not aware of user requests to enhance ksql-datagen to support non-string keys even though ksqlDB supports those (in Kafka format), so I'm not sure why support for different key formats would be different.

I think it makes sense to force keyless streams to be KAFKA format (or, we could create an alias NULL if we want to make it easier to read) and throw an error if they specify a different key format and no key columns with something like key format 'AVRO' does not support empty keys, either use key_format='NULL' or specify a (PRIMARY) KEY column.

I'm not a fan of forcing keyless streams to have key format KAFKA, feels rather arbitrary. I like the idea of either using key_format='NULL' or introducing syntax such as NULL KEY (where NULL is the type and no column name is provided). Naively I prefer the latter so the declaration occurs in the same place columns are typically defined, but I'm no SQL guru so perhaps I've proposed something blasphemous.

Then again, maybe key_format='NULL' is preferred since it extends more naturally to supporting empty value columns, assuming that's something we'd like to do in the future, since the NULL KEY option would require NULL VALUE to parallel it, which would be new syntax (since we don't have VALUE today).

PeterLindner · 2020-08-20T20:49:22Z

@big-andy-coates I remember reading a discussion that different versions of the same Avro schema are byte incompatible. May be worth adding compatibility implications of schema evolution to this Klip (especially for joins)

MichaelDrogalis · 2020-08-20T22:08:48Z

Do we really need to introduce ksql.persistence.default.format.key? I have two concerns:

If I understand right, unless someone configures their server with this property, relaunching queries that worked in a previous version will no longer work. I get why we need to have something like this, but could we not just make KEY_FORMAT default to KAFKA and deprecate it being optional a few releases down the road? We're dinging people with a lot of breaking changes in the last few releases—this one seems avoidable.
Similar to KLIP-34, the value of these types of server configs isn't apparent. I'm not sure anyone to date has asked for something like that.

MichaelDrogalis · 2020-08-21T15:01:15Z

@big-andy-coates Ah, I misread—my fault. I still think the server config is overkill, but I don't feel that strongly about that.

PeterLindner · 2020-08-26T05:37:59Z

@big-andy-coates not relevant for this Klip, but would it make sense to create a new schema registry compatibility level that does not allow evolution? Otherwise another application external to ksqldb could try to evolve the key schema and break things unitentionally

design-proposals/klip-33-key-format.md

big-andy-coates · 2020-09-01T15:13:36Z

@big-andy-coates not relevant for this Klip, but would it make sense to create a new schema registry compatibility level that does not allow evolution? Otherwise another application external to ksqldb could try to evolve the key schema and break things unitentionally

@PeterLindner, yeah, it may be possible to have a compatibility level that just says "don't allow evolution", which would be useful for key schemas. confluentinc/schema-registry#1610

design-proposals/klip-33-key-format.md

- Switch NULL -> NONE format. - Switch JOINs to repartition the right source - Add more details to NONE format

big-andy-coates · 2020-09-07T15:32:25Z

@MichaelDrogalis @derekjn @colinhicks engineers have now approved this. Can I get product approval please?

JimGalasyn

LGTM!

agavra

Skimmed what seemed new, still LGTM

colinhicks

LGTM. Thanks for the thorough and thoughtful discussion here, @big-andy-coates.

I left a few nits as suggestions for readability below.

design-proposals/klip-33-key-format.md

vcrfxia

Latest updates LGTM!

design-proposals/klip-33-key-format.md

vcrfxia · 2020-09-10T18:26:10Z

design-proposals/klip-33-key-format.md

+see out-of-order records, though per-key ordering would be maintained. Thus time-tracking 
+("stream-time"), grace-period and retention-time might be affected. However, this  phenomenon 
+already exists, and is deemed acceptable, for other implicit re-partitions.
+


Add note clarifying that users can always repartition topics themselves before the join, in order to have full control over which sources are repartitioned (and that choosing the same sources ksqlDB would repartition and performing the repartitions upfront is equivalent from a resource-usage standpoint)? Or if not here, at least in the docs section so we don't forget to add the note later.

design-proposals/klip-33-key-format.md

Co-authored-by: Victoria Xia <[email protected]>

Co-authored-by: Colin Hicks <[email protected]>

Co-authored-by: Victoria Xia <[email protected]>

docs: klip-33 key format

5d07d45

big-andy-coates requested a review from a team as a code owner August 13, 2020 18:02

agavra approved these changes Aug 13, 2020

View reviewed changes

vcrfxia approved these changes Aug 13, 2020

View reviewed changes

chore: feedback

9a47d31

mjsax reviewed Aug 16, 2020

View reviewed changes

design-proposals/klip-33-key-format.md Show resolved Hide resolved

mjsax reviewed Aug 16, 2020

View reviewed changes

design-proposals/klip-33-key-format.md Show resolved Hide resolved

mjsax reviewed Aug 16, 2020

View reviewed changes

design-proposals/klip-33-key-format.md Show resolved Hide resolved

mjsax reviewed Aug 16, 2020

View reviewed changes

design-proposals/klip-33-key-format.md Show resolved Hide resolved

mjsax reviewed Aug 16, 2020

View reviewed changes

design-proposals/klip-33-key-format.md Outdated Show resolved Hide resolved

mjsax reviewed Aug 16, 2020

View reviewed changes

design-proposals/klip-33-key-format.md Show resolved Hide resolved

mjsax reviewed Aug 16, 2020

View reviewed changes

design-proposals/klip-33-key-format.md Outdated Show resolved Hide resolved

mjsax reviewed Aug 16, 2020

View reviewed changes

design-proposals/klip-33-key-format.md Outdated Show resolved Hide resolved

mjsax reviewed Aug 16, 2020

View reviewed changes

design-proposals/klip-33-key-format.md Outdated Show resolved Hide resolved

chore: suggestions

dcd28ff

big-andy-coates requested review from agavra and vcrfxia August 20, 2020 15:54

chore: remove outstanding

4ca0788

big-andy-coates requested review from mjsax, a team, MichaelDrogalis, derekjn and colinhicks August 20, 2020 16:06

big-andy-coates added the design-proposal Tag KLIP Prs with this label label Aug 20, 2020

vpapavas reviewed Aug 27, 2020

View reviewed changes

design-proposals/klip-33-key-format.md Show resolved Hide resolved

big-andy-coates commented Sep 2, 2020

View reviewed changes

design-proposals/klip-33-key-format.md Outdated Show resolved Hide resolved

big-andy-coates requested a review from JimGalasyn as a code owner September 7, 2020 13:36

big-andy-coates force-pushed the klip-33-key-formats branch from f8c6721 to fb32216 Compare September 7, 2020 13:38

chore: updates

04c5fb5

- Switch NULL -> NONE format. - Switch JOINs to repartition the right source - Add more details to NONE format

big-andy-coates mentioned this pull request Sep 8, 2020

docs: add KLIP-34: Optional WITH #6065

Closed

JimGalasyn approved these changes Sep 8, 2020

View reviewed changes

agavra approved these changes Sep 9, 2020

View reviewed changes

Add LOE

1e036bf

colinhicks approved these changes Sep 10, 2020

View reviewed changes

vcrfxia approved these changes Sep 10, 2020

View reviewed changes

big-andy-coates and others added 10 commits September 10, 2020 20:37

Update design-proposals/klip-33-key-format.md

1b3cbac

Co-authored-by: Victoria Xia <[email protected]>

Update design-proposals/klip-33-key-format.md

2cb5af1

Co-authored-by: Colin Hicks <[email protected]>

Update design-proposals/klip-33-key-format.md

34e8790

Co-authored-by: Colin Hicks <[email protected]>

Update design-proposals/klip-33-key-format.md

558cd3c

Co-authored-by: Colin Hicks <[email protected]>

Update design-proposals/klip-33-key-format.md

c5f6e8f

Co-authored-by: Colin Hicks <[email protected]>

Update design-proposals/klip-33-key-format.md

9f5524f

Co-authored-by: Colin Hicks <[email protected]>

Update design-proposals/klip-33-key-format.md

1eaa913

Co-authored-by: Colin Hicks <[email protected]>

Update design-proposals/klip-33-key-format.md

591985a

Co-authored-by: Colin Hicks <[email protected]>

Update design-proposals/klip-33-key-format.md

8d733f3

Co-authored-by: Victoria Xia <[email protected]>

Update klip-33-key-format.md

862f1fd

big-andy-coates merged commit 0f3a8e8 into confluentinc:master Sep 11, 2020

big-andy-coates deleted the klip-33-key-formats branch September 11, 2020 10:56

This was referenced Dec 2, 2020

chore: set key schema name to avoid conflicts #6609

Closed

feat: support table joins on key format mismatch agavra/ksql#2

Closed

feat: support table joins on key format mismatch #6708

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: klip-33 key format #6017

docs: klip-33 key format #6017

big-andy-coates commented Aug 13, 2020 •

edited

Loading

agavra left a comment

agavra Aug 13, 2020

blueedgenick Aug 20, 2020

big-andy-coates Sep 7, 2020

vcrfxia left a comment

big-andy-coates commented Aug 20, 2020 •

edited

Loading

agavra commented Aug 20, 2020

big-andy-coates commented Aug 20, 2020 •

edited

Loading

vcrfxia commented Aug 20, 2020 •

edited

Loading

PeterLindner commented Aug 20, 2020

MichaelDrogalis commented Aug 20, 2020 •

edited

Loading

MichaelDrogalis commented Aug 21, 2020

PeterLindner commented Aug 26, 2020

big-andy-coates commented Sep 1, 2020

big-andy-coates commented Sep 7, 2020

JimGalasyn left a comment

agavra left a comment

colinhicks left a comment

vcrfxia left a comment

vcrfxia Sep 10, 2020

docs: klip-33 key format #6017

docs: klip-33 key format #6017

Conversation

big-andy-coates commented Aug 13, 2020 • edited Loading

agavra left a comment

Choose a reason for hiding this comment

agavra Aug 13, 2020

Choose a reason for hiding this comment

blueedgenick Aug 20, 2020

Choose a reason for hiding this comment

big-andy-coates Sep 7, 2020

Choose a reason for hiding this comment

vcrfxia left a comment

Choose a reason for hiding this comment

big-andy-coates commented Aug 20, 2020 • edited Loading

agavra commented Aug 20, 2020

big-andy-coates commented Aug 20, 2020 • edited Loading

vcrfxia commented Aug 20, 2020 • edited Loading

PeterLindner commented Aug 20, 2020

MichaelDrogalis commented Aug 20, 2020 • edited Loading

MichaelDrogalis commented Aug 21, 2020

PeterLindner commented Aug 26, 2020

big-andy-coates commented Sep 1, 2020

big-andy-coates commented Sep 7, 2020

JimGalasyn left a comment

Choose a reason for hiding this comment

agavra left a comment

Choose a reason for hiding this comment

colinhicks left a comment

Choose a reason for hiding this comment

vcrfxia left a comment

Choose a reason for hiding this comment

vcrfxia Sep 10, 2020

Choose a reason for hiding this comment

big-andy-coates commented Aug 13, 2020 •

edited

Loading

big-andy-coates commented Aug 20, 2020 •

edited

Loading

big-andy-coates commented Aug 20, 2020 •

edited

Loading

vcrfxia commented Aug 20, 2020 •

edited

Loading

MichaelDrogalis commented Aug 20, 2020 •

edited

Loading