Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: pick up key name from field name on GROUP BY and PARTITON BY #4902

Merged

Conversation

big-andy-coates
Copy link
Contributor

@big-andy-coates big-andy-coates commented Mar 26, 2020

Description

Stacked on top of #4899. Only review the second commit.

fixes: #4895

This commit changes GROUP BY, PARTITION BY and JOINs to pick up the key column name from a struct field, e.g.

CREATE STREAM O AS SELECT * FROM I PARTITION BY A->B;

Will result in the key column being called B.

Testing done

usual

Reviewer checklist

  • Ensure docs are updated if necessary. (eg. if a user visible feature is being added or changed).
  • Ensure relevant issues are linked (description should include text like "Fixes #")

fixes: confluentinc#4898

This commit sees the result of a GROUP BY on a single column reference have a schema with a key column matching the name of the column, e.g.

```sql
-- source schema: A -> B, C
CREATE STREAM OUTPUT AS SELECT COUNT(1) AS COUNT FROM INPUT GROUP BY B;
-- output schema: B -> COUNT
```

If the GROUP BY is on anything other than a single column reference then the key column will be a unique generated column name, e.g.

```sql
-- source schema: A -> B, C
CREATE STREAM OUTPUT AS SELECT COUNT(1) FROM INPUT GROUP BY B+1;
-- output schema: KSQL_COL_1 -> KSQL_COL_0  (Both names are generated)
```

BREAKING CHANGE: Existing queries that reference a single GROUP BY column in the projection would fail if they were resubmitted, due to a duplicate column. The same existing queries will continue to run if already running, i.e. this is only a change for newly submitted queries. Existing queries will use the old query semantics.
@big-andy-coates big-andy-coates requested a review from a team as a code owner March 26, 2020 16:53
agavra
agavra previously requested changes Mar 26, 2020
Copy link
Contributor

@agavra agavra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is the right behavior to be implementing.

For one, if I were to do a SELECT ADDRESS->TOWN FROM ADDRESSES I would get a schema ADDRESSES__TOWN, not TOWN - so at a minimum we should be consistent with that behavior.

Next, though, is what do we do if we have something like:

CREATE STREAM A (k VARCHAR KEY, town VARCHAR, address STRUCT<town VARCHAR>);
SELECT TOWN FROM A PARTITION BY address->town;

This would now fail because TOWN would exist twice in the resulting schema! I would then be forced to rename the town in the value schema because I don't have a choice in the naming of the key (yet).

@big-andy-coates
Copy link
Contributor Author

big-andy-coates commented Mar 27, 2020

For one, if I were to do a SELECT ADDRESS->TOWN FROM ADDRESSES I would get a schema ADDRESSES__TOWN, not TOWN - so at a minimum we should be consistent with that behavior.
Next, though, is what do we do if we have something like:

CREATE STREAM A (k VARCHAR KEY, town VARCHAR, address STRUCT<town VARCHAR>);
SELECT TOWN FROM A PARTITION BY address->town;

This would now fail because TOWN would exist twice in the resulting schema! I would then be forced to rename the town in the value schema because I don't have a choice in the naming of the key (yet).

There is always a chance of a name clash when we're deriving column names from the schema. Switching from TOWN to ADDRESS__TOWN does not fix that:

CREATE STREAM A (k VARCHAR KEY, address__town VARCHAR, address STRUCT<town VARCHAR>);
SELECT * FROM A PARTITION BY address->town;

Hence, I don't think requiring the alias is unreasonable.

However, you make a good point about consistency. I think this should fail with a duplicate column name:

SELECT A->B, C FROM X PARTITION BY A->B;

And for that to fail we'd need consistent naming! So I'll look to switch.

ksqldb-engine/src/main/java/io/confluent/ksql/planner/LogicalPlanner.java
ksqldb-functional-tests/src/test/resources/query-validation-tests/group-by.json
ksqldb-streams/src/main/java/io/confluent/ksql/execution/streams/GroupByParamsFactory.java
ksqldb-streams/src/test/java/io/confluent/ksql/execution/streams/GroupByParamsFactoryTest.java
@big-andy-coates
Copy link
Contributor Author

Humm... while looking into this I found this issue: #4911, which we probably want to resolve as part of this.

Copy link
Contributor

@agavra agavra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@big-andy-coates big-andy-coates merged commit e2ee7e8 into confluentinc:master Mar 28, 2020
@big-andy-coates big-andy-coates deleted the any_key_by_struct_field branch March 28, 2020 17:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

GROUP BY and PARTITION BY on field of struct should take key name from field name
2 participants