Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Auto repartitioning support]: Auto repartition joins on key format mismatch #6229

Closed
big-andy-coates opened this issue Sep 16, 2020 · 6 comments · Fixed by #6708
Closed

Comments

@big-andy-coates
Copy link
Contributor

Auto-repartitioning on key format mismatch. Adds support for automatic repartitioning of streams and tables for joins where key formats do not match.

  1. Update the logical plan so that we introduce repartition steps on (right) sources as needed to ensure correct key format.
    • Make use of existing repartitions where possible.
  2. Update docs to cover
    • the implicit / automatic repartitioning.
    • repartition the right source and why (see KLIP-33 for why). For stream-stream and table-table joins, use can control which side is repartitioned by changing sides.
    • example of how to manually repartition a source to avoid implicit repartitioning, e.g. stream-table join where the order can be flipped.
  3. Lots of QTT tests!
  4. Doc updates.
@big-andy-coates
Copy link
Contributor Author

big-andy-coates commented Sep 30, 2020

Need to think, long term, how do we compare format properties?

delimited's DELIMITER is important. A two column key with , separator isn't going to join to a two column key with | separator.

We probably want to auto-repatition is the properties mismatch, not just the format.

@big-andy-coates
Copy link
Contributor Author

Important thing to test: If we're repartitioning a table, then we should ensure we create only a single state store and single changelog topic

e.g.

CREATE STREAM S2 AS 
   SELECT S.ID, T.VAL FROM S JOIN T ON S.ID = T.ID;

Given T is a table, we'd normally create a _confluent-ksql-some.ksql.service.idquery_CSAS_S2_0-KafkaTopic_Right-Reduce-changelog to reduce the source topic into a table.

If T is being repartitioned because its key format is wrong, then we should ensure we don't add a second change log to handle the change of key format. Instead we should either:

  • avoid creating the initial changelog and only write a new changelog with the new format, or
  • write the first changelog with the correct, updated, key format. (Not sure this is possible).

The overhead of two changelog topics could be prohibitive for many.

@big-andy-coates
Copy link
Contributor Author

FYI PR #6393 goes some way to auto-repartitioning streams where key formats differ. However, it does not do all that is covered by KLIP-33. Specifically, it does not repartition tables.

@big-andy-coates
Copy link
Contributor Author

@agavra I see you've moved this to in review - but I don't see a PR linked...

@agavra
Copy link
Contributor

agavra commented Oct 19, 2020

@big-andy-coates oops, I misread this ticket. I assumed it was referring to what #6393 addresses. I'll move it back to the todo column

@vcrfxia
Copy link
Contributor

vcrfxia commented Nov 20, 2020

#6635 added support for repartitioning tables so the enhancement described in this ticket is not far off. However, the functionality for repartitioning tables added in #6635 results in two state stores (and therefore two changelog topics) per table, rather than one, which we should probably look to fix before implementing the enhancement in this ticket (see #6650).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants