Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: add GROUP BY support for any key names #4899

Merged

Conversation

big-andy-coates
Copy link
Contributor

Description

fixes: #4898

This commit sees the result of a GROUP BY on a single column reference have a schema with a key column matching the name of the column, e.g.

-- source schema: A -> B, C
CREATE STREAM OUTPUT AS SELECT COUNT(1) AS COUNT FROM INPUT GROUP BY B;
-- output schema: B -> COUNT

If the GROUP BY is on anything other than a single column reference then the key column will be a unique generated column name, e.g.

-- source schema: A -> B, C
CREATE STREAM OUTPUT AS SELECT COUNT(1) FROM INPUT GROUP BY B+1;
-- output schema: KSQL_COL_1 -> KSQL_COL_0  (Both names are generated)

BREAKING CHANGE: Existing queries that reference a single GROUP BY column in the projection would fail if they were resubmitted, due to a duplicate column. The same existing queries will continue to run if already running, i.e. this is only a change for newly submitted queries. Existing queries will use the old query semantics.

This PR also includes an change to the internals of how GROUP BY is handled: the name of the columns from the source in the repartition and changelog topics had names like KSQL_INTERNAL_COL_0 etc. With this source columns retain their original names within the internal topics. Names such as KSQL_INTERNAL_COL_0 are used only for additional columns used to track UDAF parameters. This change is fully backwards compatible, as the plan stores the column names used in the internal topics. However, the change has meant a whole load of new historical plans needed to be generated.

Testing done

usual

Reviewer checklist

  • Ensure docs are updated if necessary. (eg. if a user visible feature is being added or changed).
  • Ensure relevant issues are linked (description should include text like "Fixes #")

@big-andy-coates big-andy-coates requested a review from a team as a code owner March 26, 2020 00:46
@agavra
Copy link
Contributor

agavra commented Mar 26, 2020

@big-andy-coates thanks for this PR! Do you mind splitting it into two commits to separate the historical plans from the code?

big-andy-coates and others added 2 commits March 26, 2020 13:10
fixes: confluentinc#4898

This commit sees the result of a GROUP BY on a single column reference have a schema with a key column matching the name of the column, e.g.

```sql
-- source schema: A -> B, C
CREATE STREAM OUTPUT AS SELECT COUNT(1) AS COUNT FROM INPUT GROUP BY B;
-- output schema: B -> COUNT
```

If the GROUP BY is on anything other than a single column reference then the key column will be a unique generated column name, e.g.

```sql
-- source schema: A -> B, C
CREATE STREAM OUTPUT AS SELECT COUNT(1) FROM INPUT GROUP BY B+1;
-- output schema: KSQL_COL_1 -> KSQL_COL_0  (Both names are generated)
```

BREAKING CHANGE: Existing queries that reference a single GROUP BY column in the projection would fail if they were resubmitted, due to a duplicate column. The same existing queries will continue to run if already running, i.e. this is only a change for newly submitted queries. Existing queries will use the old query semantics.
Copy link
Contributor

@agavra agavra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks @big-andy-coates - FYI I split your commit into two so that I could review it better and force-pushed to your branch

@big-andy-coates big-andy-coates merged commit e7cbdfc into confluentinc:master Mar 27, 2020
@big-andy-coates big-andy-coates deleted the group_by_semantics branch March 27, 2020 10:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fix GROUP BY semantics for keys with any names
2 participants