Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to a binary format for internal record keys #724

Closed
rodesai opened this issue Feb 13, 2018 · 4 comments
Closed

Switch to a binary format for internal record keys #724

rodesai opened this issue Feb 13, 2018 · 4 comments

Comments

@rodesai
Copy link
Contributor

rodesai commented Feb 13, 2018

When processing a query with a GROUP BY clause containing multiple columns, KSQL generates a new key for each record so that streams can repartition and aggregate according to the GROUP BY clause. The new key is the values in the GROUP BY columns concatenated together separated by the string "|+|". If the values themselves contain this separator then the resulting key may be ambiguous.

For example, consider the following query:

CREATE TABLE AS SELECT A, B, count(*) FROM STREAM FOO GROUP BY A, B

And the following two records:

{..."A":"foo|+|bar", "B":"baz"...}
{..."A":"foo", "B":"bar|+|baz"...}

Both records will take the key "foo|+|bar|+|baz", and the result of the aggregate will be incorrect.

@rodesai rodesai added the bug label Feb 13, 2018
@apurvam apurvam changed the title Possibility of ambiguous record key after GROUP BY with multiple columns Possibility of ambiguous record key after GROUP BY with multiple columns could result in wrong aggregations Feb 13, 2018
@rodesai rodesai changed the title Possibility of ambiguous record key after GROUP BY with multiple columns could result in wrong aggregations Switch to a binary format for internal record keys Mar 8, 2018
@rodesai
Copy link
Contributor Author

rodesai commented Mar 8, 2018

Fixing this will require using some sort of binary format for internal record key. Some options:
- fields with a size prefix
- avro

This may also require that we track key schemas per topic. It may not be reasonable to expect source topics to bey keyed according to this internal format (We currently have that expectation but feel thats ok because its reasonable to expect a string key). This means we would need to support both some external format and our internal format.

@miguno
Copy link
Contributor

miguno commented Apr 16, 2018

Related ticket: #824

@rmoff
Copy link
Contributor

rmoff commented Feb 8, 2019

@rodesai is this superseded by #824? Or remains a separate issue to address?

@apurvam
Copy link
Contributor

apurvam commented Oct 16, 2019

This is superseded by #824 . also see #3533 . Closing this out.

@apurvam apurvam closed this as completed Oct 16, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants