-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: klip-33 key format #6017
docs: klip-33 key format #6017
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Kafka micro site examples to leverage the new functionality, as these have automated testing. | ||
It may be worth changing the ksqlDB quickstart too - TBD, as this will require extending DataGen | ||
to support other key formats. Something we may want in scope anyway - or should be end-of-life | ||
DataGen in favour of the datagen connector? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't eol the datagen tool!
It's far too useful for anyone who wants to test or demo something quickly. And before someone says "you can just run an embedded datagen connector though so what's the difference?" :-), allow me to opine that this requires a steeper learning curve and advanced troubleshooting skills to get working right, which I think should be avoided where possible. Just an opinion of course ;)
I'd also say that the connector should be refactored (as i recall suggesting loudly when it was originally forked off from here) so that it embeds some re-usable portion of the datagen tool (which likely requires a small refactor of datagen itself too, to facilitate, to be fair) rather than being a copy/paste that now proceeds on its own life journey and inevitable divergence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@blueedgenick are you putting a case forward that DataGen should be enhanced to support Avro / Json keys as part of this work, i.e. it should be in-scaope?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks @big-andy-coates
@agavra @vcrfxia I've added you back in for a review as there have been some changes, and I've found some more edge cases that need more thought. See outstanding questions in the description at the top of this page, and also would be great if you could look at and think about the implementation section where it talks about deciding which side to repartition if key formats don't match. |
tl;dr I think the KLIP is good whichever way you decide to go on the outstanding issues. I have some thoughts below if you need some tie-breaker on what you're intuition tells you.
IMO, deprecating DataGen is scope creep that we should avoid if possible. If a user really wants to generate keys in other formats, I'm happy just leaving it publishing only KAFKA keys and then asking them to manually
🤷 (responding to this and impl section on choosing key) Whichever decision we make will upset some people and be sub-optimal for their use cases. I think it might make sense to default to repartition the table, and then do what @mjsax suggested to repartition the stream if it's getting re-partitioned anyway. I wouldn't over-index on this because it's likely that organizations use the same data format throughout, so I suspect that this scenario will be somewhat rare.
🤔 that's an interesting scenario. I think it makes sense to force keyless streams to be |
Good point. We could just NOT support it... For keyless streams, I've added a proposal to the KLIP:https://github.com/confluentinc/ksql/blob/4ca0788edb5e72537890bdb1c89628e039680320/design-proposals/klip-33-key-format.md#schema-inference. Just thinking out loud really... The |
+1 to this. I don't think we should invest the effort to enhance datagen to support multiple key formats (and types) unless we receive user requests to do so. I'm not aware of user requests to enhance ksql-datagen to support non-string keys even though ksqlDB supports those (in Kafka format), so I'm not sure why support for different key formats would be different.
I'm not a fan of forcing keyless streams to have key format KAFKA, feels rather arbitrary. I like the idea of either using Then again, maybe |
@big-andy-coates I remember reading a discussion that different versions of the same Avro schema are byte incompatible. May be worth adding compatibility implications of schema evolution to this Klip (especially for joins) |
Do we really need to introduce
|
@big-andy-coates Ah, I misread—my fault. I still think the server config is overkill, but I don't feel that strongly about that. |
@big-andy-coates not relevant for this Klip, but would it make sense to create a new schema registry compatibility level that does not allow evolution? Otherwise another application external to ksqldb could try to evolve the key schema and break things unitentionally |
@PeterLindner, yeah, it may be possible to have a compatibility level that just says "don't allow evolution", which would be useful for key schemas. confluentinc/schema-registry#1610 |
f8c6721
to
fb32216
Compare
- Switch NULL -> NONE format. - Switch JOINs to repartition the right source - Add more details to NONE format
@MichaelDrogalis @derekjn @colinhicks engineers have now approved this. Can I get product approval please? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Skimmed what seemed new, still LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for the thorough and thoughtful discussion here, @big-andy-coates.
I left a few nits as suggestions for readability below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Latest updates LGTM!
see out-of-order records, though per-key ordering would be maintained. Thus time-tracking | ||
("stream-time"), grace-period and retention-time might be affected. However, this phenomenon | ||
already exists, and is deemed acceptable, for other implicit re-partitions. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add note clarifying that users can always repartition topics themselves before the join, in order to have full control over which sources are repartitioned (and that choosing the same sources ksqlDB would repartition and performing the repartitions upfront is equivalent from a resource-usage standpoint)? Or if not here, at least in the docs section so we don't forget to add the note later.
Co-authored-by: Victoria Xia <[email protected]>
Co-authored-by: Colin Hicks <[email protected]>
Co-authored-by: Colin Hicks <[email protected]>
Co-authored-by: Colin Hicks <[email protected]>
Co-authored-by: Colin Hicks <[email protected]>
Co-authored-by: Colin Hicks <[email protected]>
Co-authored-by: Colin Hicks <[email protected]>
Co-authored-by: Colin Hicks <[email protected]>
Co-authored-by: Victoria Xia <[email protected]>
KLIP adding
KEY_FORMAT
and other things to the language.