-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore: pick up key name from field name on GROUP BY and PARTITON BY #4902
chore: pick up key name from field name on GROUP BY and PARTITON BY #4902
Conversation
fixes: confluentinc#4898 This commit sees the result of a GROUP BY on a single column reference have a schema with a key column matching the name of the column, e.g. ```sql -- source schema: A -> B, C CREATE STREAM OUTPUT AS SELECT COUNT(1) AS COUNT FROM INPUT GROUP BY B; -- output schema: B -> COUNT ``` If the GROUP BY is on anything other than a single column reference then the key column will be a unique generated column name, e.g. ```sql -- source schema: A -> B, C CREATE STREAM OUTPUT AS SELECT COUNT(1) FROM INPUT GROUP BY B+1; -- output schema: KSQL_COL_1 -> KSQL_COL_0 (Both names are generated) ``` BREAKING CHANGE: Existing queries that reference a single GROUP BY column in the projection would fail if they were resubmitted, due to a duplicate column. The same existing queries will continue to run if already running, i.e. this is only a change for newly submitted queries. Existing queries will use the old query semantics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is the right behavior to be implementing.
For one, if I were to do a SELECT ADDRESS->TOWN FROM ADDRESSES
I would get a schema ADDRESSES__TOWN
, not TOWN
- so at a minimum we should be consistent with that behavior.
Next, though, is what do we do if we have something like:
CREATE STREAM A (k VARCHAR KEY, town VARCHAR, address STRUCT<town VARCHAR>);
SELECT TOWN FROM A PARTITION BY address->town;
This would now fail because TOWN
would exist twice in the resulting schema! I would then be forced to rename the town
in the value schema because I don't have a choice in the naming of the key (yet).
There is always a chance of a name clash when we're deriving column names from the schema. Switching from CREATE STREAM A (k VARCHAR KEY, address__town VARCHAR, address STRUCT<town VARCHAR>);
SELECT * FROM A PARTITION BY address->town; Hence, I don't think requiring the alias is unreasonable. However, you make a good point about consistency. I think this should fail with a duplicate column name: SELECT A->B, C FROM X PARTITION BY A->B; And for that to fail we'd need consistent naming! So I'll look to switch. |
ksqldb-engine/src/main/java/io/confluent/ksql/planner/LogicalPlanner.java ksqldb-functional-tests/src/test/resources/query-validation-tests/group-by.json ksqldb-streams/src/main/java/io/confluent/ksql/execution/streams/GroupByParamsFactory.java ksqldb-streams/src/test/java/io/confluent/ksql/execution/streams/GroupByParamsFactoryTest.java
Humm... while looking into this I found this issue: #4911, which we probably want to resolve as part of this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Description
Stacked on top of #4899. Only review the second commit.
fixes: #4895
This commit changes GROUP BY, PARTITION BY and JOINs to pick up the key column name from a struct field, e.g.
Will result in the key column being called
B
.Testing done
usual
Reviewer checklist