Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: klip-33 key format #6017

Merged

Conversation

big-andy-coates
Copy link
Contributor

@big-andy-coates big-andy-coates commented Aug 13, 2020

KLIP adding KEY_FORMAT and other things to the language.

@big-andy-coates big-andy-coates requested a review from a team as a code owner August 13, 2020 18:02
Copy link
Contributor

@agavra agavra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

design-proposals/klip-33-key-format.md Show resolved Hide resolved
design-proposals/klip-33-key-format.md Outdated Show resolved Hide resolved
design-proposals/klip-33-key-format.md Show resolved Hide resolved
design-proposals/klip-33-key-format.md Outdated Show resolved Hide resolved
design-proposals/klip-33-key-format.md Outdated Show resolved Hide resolved
Kafka micro site examples to leverage the new functionality, as these have automated testing.
It may be worth changing the ksqlDB quickstart too - TBD, as this will require extending DataGen
to support other key formats. Something we may want in scope anyway - or should be end-of-life
DataGen in favour of the datagen connector?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't eol the datagen tool!
It's far too useful for anyone who wants to test or demo something quickly. And before someone says "you can just run an embedded datagen connector though so what's the difference?" :-), allow me to opine that this requires a steeper learning curve and advanced troubleshooting skills to get working right, which I think should be avoided where possible. Just an opinion of course ;)
I'd also say that the connector should be refactored (as i recall suggesting loudly when it was originally forked off from here) so that it embeds some re-usable portion of the datagen tool (which likely requires a small refactor of datagen itself too, to facilitate, to be fair) rather than being a copy/paste that now proceeds on its own life journey and inevitable divergence.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@blueedgenick are you putting a case forward that DataGen should be enhanced to support Avro / Json keys as part of this work, i.e. it should be in-scaope?

design-proposals/klip-33-key-format.md Show resolved Hide resolved
Copy link
Contributor

@vcrfxia vcrfxia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks @big-andy-coates

design-proposals/klip-33-key-format.md Outdated Show resolved Hide resolved
design-proposals/klip-33-key-format.md Outdated Show resolved Hide resolved
design-proposals/klip-33-key-format.md Show resolved Hide resolved
design-proposals/klip-33-key-format.md Outdated Show resolved Hide resolved
design-proposals/klip-33-key-format.md Outdated Show resolved Hide resolved
@big-andy-coates
Copy link
Contributor Author

big-andy-coates commented Aug 20, 2020

@agavra @vcrfxia I've added you back in for a review as there have been some changes, and I've found some more edge cases that need more thought.

See outstanding questions in the description at the top of this page, and also would be great if you could look at and think about the implementation section where it talks about deciding which side to repartition if key formats don't match.

@agavra
Copy link
Contributor

agavra commented Aug 20, 2020

tl;dr I think the KLIP is good whichever way you decide to go on the outstanding issues. I have some thoughts below if you need some tie-breaker on what you're intuition tells you.


* DataGen: Do we end-of-life our DataGen in favour of the datagen connector or update DataGen to support other key formats?

IMO, deprecating DataGen is scope creep that we should avoid if possible. If a user really wants to generate keys in other formats, I'm happy just leaving it publishing only KAFKA keys and then asking them to manually CREATE STREAM bar WITH(key_format='AVRO') AS SELECT * FROM foo . This is probably a product decision though, but I don't think we should block the KLIP on it.

* Cost based optimiser: Do we price the cost of repartitioning tables slightly less than streams? Based on the BIG assumption that streams, on average, see higher throughput than tables.

🤷 (responding to this and impl section on choosing key) Whichever decision we make will upset some people and be sub-optimal for their use cases. I think it might make sense to default to repartition the table, and then do what @mjsax suggested to repartition the stream if it's getting re-partitioned anyway. I wouldn't over-index on this because it's likely that organizations use the same data format throughout, so I suspect that this scenario will be somewhat rare.

* How to handle key-less streams? If the default key format is one that supports the schema registry, should we / can we register an empty schema? If not, how can we differentiate a missing schema, i.e. an error, from a key-less stream?

🤔 that's an interesting scenario. I think it makes sense to force keyless streams to be KAFKA format (or, we could create an alias NULL if we want to make it easier to read) and throw an error if they specify a different key format and no key columns with something like key format 'AVRO' does not support empty keys, either use key_format='NULL' or specify a (PRIMARY) KEY column

@big-andy-coates
Copy link
Contributor Author

big-andy-coates commented Aug 20, 2020

I wouldn't over-index on this because it's likely that organizations use the same data format throughout, so I suspect that this scenario will be somewhat rare.

Good point. We could just NOT support it...

For keyless streams, I've added a proposal to the KLIP:https://github.com/confluentinc/ksql/blob/4ca0788edb5e72537890bdb1c89628e039680320/design-proposals/klip-33-key-format.md#schema-inference. Just thinking out loud really...

The NULL key_format is probably an easier solution though! ;)

@big-andy-coates big-andy-coates added the design-proposal Tag KLIP Prs with this label label Aug 20, 2020
@vcrfxia
Copy link
Contributor

vcrfxia commented Aug 20, 2020

IMO, deprecating DataGen is scope creep that we should avoid if possible. If a user really wants to generate keys in other formats, I'm happy just leaving it publishing only KAFKA keys and then asking them to manually CREATE STREAM bar WITH(key_format='AVRO') AS SELECT * FROM foo .

+1 to this. I don't think we should invest the effort to enhance datagen to support multiple key formats (and types) unless we receive user requests to do so. I'm not aware of user requests to enhance ksql-datagen to support non-string keys even though ksqlDB supports those (in Kafka format), so I'm not sure why support for different key formats would be different.

I think it makes sense to force keyless streams to be KAFKA format (or, we could create an alias NULL if we want to make it easier to read) and throw an error if they specify a different key format and no key columns with something like key format 'AVRO' does not support empty keys, either use key_format='NULL' or specify a (PRIMARY) KEY column.

I'm not a fan of forcing keyless streams to have key format KAFKA, feels rather arbitrary. I like the idea of either using key_format='NULL' or introducing syntax such as NULL KEY (where NULL is the type and no column name is provided). Naively I prefer the latter so the declaration occurs in the same place columns are typically defined, but I'm no SQL guru so perhaps I've proposed something blasphemous.

Then again, maybe key_format='NULL' is preferred since it extends more naturally to supporting empty value columns, assuming that's something we'd like to do in the future, since the NULL KEY option would require NULL VALUE to parallel it, which would be new syntax (since we don't have VALUE today).

@PeterLindner
Copy link

@big-andy-coates I remember reading a discussion that different versions of the same Avro schema are byte incompatible. May be worth adding compatibility implications of schema evolution to this Klip (especially for joins)

@MichaelDrogalis
Copy link
Contributor

MichaelDrogalis commented Aug 20, 2020

Do we really need to introduce ksql.persistence.default.format.key? I have two concerns:

  1. If I understand right, unless someone configures their server with this property, relaunching queries that worked in a previous version will no longer work. I get why we need to have something like this, but could we not just make KEY_FORMAT default to KAFKA and deprecate it being optional a few releases down the road? We're dinging people with a lot of breaking changes in the last few releases—this one seems avoidable.

  2. Similar to KLIP-34, the value of these types of server configs isn't apparent. I'm not sure anyone to date has asked for something like that.

@MichaelDrogalis
Copy link
Contributor

@big-andy-coates Ah, I misread—my fault. I still think the server config is overkill, but I don't feel that strongly about that.

@PeterLindner
Copy link

@big-andy-coates not relevant for this Klip, but would it make sense to create a new schema registry compatibility level that does not allow evolution? Otherwise another application external to ksqldb could try to evolve the key schema and break things unitentionally

@big-andy-coates
Copy link
Contributor Author

@big-andy-coates not relevant for this Klip, but would it make sense to create a new schema registry compatibility level that does not allow evolution? Otherwise another application external to ksqldb could try to evolve the key schema and break things unitentionally

@PeterLindner, yeah, it may be possible to have a compatibility level that just says "don't allow evolution", which would be useful for key schemas. confluentinc/schema-registry#1610

- Switch NULL -> NONE format.
- Switch JOINs to repartition the right source
- Add more details to NONE format
@big-andy-coates
Copy link
Contributor Author

@MichaelDrogalis @derekjn @colinhicks engineers have now approved this. Can I get product approval please?

Copy link
Member

@JimGalasyn JimGalasyn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Contributor

@agavra agavra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skimmed what seemed new, still LGTM

Copy link
Contributor

@colinhicks colinhicks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for the thorough and thoughtful discussion here, @big-andy-coates.

I left a few nits as suggestions for readability below.

design-proposals/klip-33-key-format.md Outdated Show resolved Hide resolved
design-proposals/klip-33-key-format.md Outdated Show resolved Hide resolved
design-proposals/klip-33-key-format.md Outdated Show resolved Hide resolved
design-proposals/klip-33-key-format.md Outdated Show resolved Hide resolved
design-proposals/klip-33-key-format.md Outdated Show resolved Hide resolved
design-proposals/klip-33-key-format.md Outdated Show resolved Hide resolved
design-proposals/klip-33-key-format.md Outdated Show resolved Hide resolved
design-proposals/klip-33-key-format.md Show resolved Hide resolved
Copy link
Contributor

@vcrfxia vcrfxia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Latest updates LGTM!

design-proposals/klip-33-key-format.md Outdated Show resolved Hide resolved
see out-of-order records, though per-key ordering would be maintained. Thus time-tracking
("stream-time"), grace-period and retention-time might be affected. However, this phenomenon
already exists, and is deemed acceptable, for other implicit re-partitions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add note clarifying that users can always repartition topics themselves before the join, in order to have full control over which sources are repartitioned (and that choosing the same sources ksqlDB would repartition and performing the repartitions upfront is equivalent from a resource-usage standpoint)? Or if not here, at least in the docs section so we don't forget to add the note later.

design-proposals/klip-33-key-format.md Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design-proposal Tag KLIP Prs with this label
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants