chore: pick up key name from field name on GROUP BY and PARTITON BY #4902

big-andy-coates · 2020-03-26T16:53:32Z

Description

Stacked on top of #4899. Only review the second commit.

This commit changes GROUP BY, PARTITION BY and JOINs to pick up the key column name from a struct field, e.g.

CREATE STREAM O AS SELECT * FROM I PARTITION BY A->B;

Will result in the key column being called B.

Testing done

usual

Reviewer checklist

Ensure docs are updated if necessary. (eg. if a user visible feature is being added or changed).
Ensure relevant issues are linked (description should include text like "Fixes #")

fixes: confluentinc#4898 This commit sees the result of a GROUP BY on a single column reference have a schema with a key column matching the name of the column, e.g. ```sql -- source schema: A -> B, C CREATE STREAM OUTPUT AS SELECT COUNT(1) AS COUNT FROM INPUT GROUP BY B; -- output schema: B -> COUNT ``` If the GROUP BY is on anything other than a single column reference then the key column will be a unique generated column name, e.g. ```sql -- source schema: A -> B, C CREATE STREAM OUTPUT AS SELECT COUNT(1) FROM INPUT GROUP BY B+1; -- output schema: KSQL_COL_1 -> KSQL_COL_0 (Both names are generated) ``` BREAKING CHANGE: Existing queries that reference a single GROUP BY column in the projection would fail if they were resubmitted, due to a duplicate column. The same existing queries will continue to run if already running, i.e. this is only a change for newly submitted queries. Existing queries will use the old query semantics.

fixes: confluentinc#4895

agavra

I don't think this is the right behavior to be implementing.

For one, if I were to do a SELECT ADDRESS->TOWN FROM ADDRESSES I would get a schema ADDRESSES__TOWN, not TOWN - so at a minimum we should be consistent with that behavior.

Next, though, is what do we do if we have something like:

CREATE STREAM A (k VARCHAR KEY, town VARCHAR, address STRUCT<town VARCHAR>);
SELECT TOWN FROM A PARTITION BY address->town;

This would now fail because TOWN would exist twice in the resulting schema! I would then be forced to rename the town in the value schema because I don't have a choice in the naming of the key (yet).

big-andy-coates · 2020-03-27T10:12:56Z

For one, if I were to do a SELECT ADDRESS->TOWN FROM ADDRESSES I would get a schema ADDRESSES__TOWN, not TOWN - so at a minimum we should be consistent with that behavior.
Next, though, is what do we do if we have something like:
CREATE STREAM A (k VARCHAR KEY, town VARCHAR, address STRUCT<town VARCHAR>);
SELECT TOWN FROM A PARTITION BY address->town;
This would now fail because TOWN would exist twice in the resulting schema! I would then be forced to rename the town in the value schema because I don't have a choice in the naming of the key (yet).

There is always a chance of a name clash when we're deriving column names from the schema. Switching from TOWN to ADDRESS__TOWN does not fix that:

CREATE STREAM A (k VARCHAR KEY, address__town VARCHAR, address STRUCT<town VARCHAR>);
SELECT * FROM A PARTITION BY address->town;

Hence, I don't think requiring the alias is unreasonable.

However, you make a good point about consistency. I think this should fail with a duplicate column name:

SELECT A->B, C FROM X PARTITION BY A->B;

And for that to fail we'd need consistent naming! So I'll look to switch.

ksqldb-engine/src/main/java/io/confluent/ksql/planner/LogicalPlanner.java ksqldb-functional-tests/src/test/resources/query-validation-tests/group-by.json ksqldb-streams/src/main/java/io/confluent/ksql/execution/streams/GroupByParamsFactory.java ksqldb-streams/src/test/java/io/confluent/ksql/execution/streams/GroupByParamsFactoryTest.java

big-andy-coates · 2020-03-27T11:54:15Z

Humm... while looking into this I found this issue: #4911, which we probably want to resolve as part of this.

no longer applicable

agavra

LGTM!

big-andy-coates added 2 commits March 26, 2020 00:41

chore: pick up key name from field name on GROUP BY and PARTITON BY

ca2b355

fixes: confluentinc#4895

big-andy-coates requested a review from a team as a code owner March 26, 2020 16:53

agavra previously requested changes Mar 26, 2020

View reviewed changes

big-andy-coates added 2 commits March 27, 2020 10:35

chore: unintentional change reverted

2035e87

big-andy-coates added 2 commits March 27, 2020 12:03

chore: align naming of generated key column with projection name

5789910

chore: update tests

1529453

big-andy-coates requested a review from agavra March 27, 2020 12:06

big-andy-coates assigned agavra Mar 27, 2020

agavra approved these changes Mar 27, 2020

View reviewed changes

chore: fix test

03c5e41

big-andy-coates merged commit e2ee7e8 into confluentinc:master Mar 28, 2020

big-andy-coates deleted the any_key_by_struct_field branch March 28, 2020 17:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: pick up key name from field name on GROUP BY and PARTITON BY #4902

chore: pick up key name from field name on GROUP BY and PARTITON BY #4902

big-andy-coates commented Mar 26, 2020 •

edited

Loading

agavra left a comment

big-andy-coates commented Mar 27, 2020 •

edited

Loading

big-andy-coates commented Mar 27, 2020

agavra left a comment

chore: pick up key name from field name on GROUP BY and PARTITON BY #4902

chore: pick up key name from field name on GROUP BY and PARTITON BY #4902

Conversation

big-andy-coates commented Mar 26, 2020 • edited Loading

Description

Testing done

Reviewer checklist

agavra left a comment

Choose a reason for hiding this comment

big-andy-coates commented Mar 27, 2020 • edited Loading

big-andy-coates commented Mar 27, 2020

agavra left a comment

Choose a reason for hiding this comment

big-andy-coates commented Mar 26, 2020 •

edited

Loading

big-andy-coates commented Mar 27, 2020 •

edited

Loading