fix: fix repartition semantics #4816

big-andy-coates · 2020-03-18T18:44:29Z

Description

Fixes: #4749. (Note: the feature is currently disabled behind the 'allow any key column name' feature flag).

Background

This change fixes an issue with our repartition semantics.

Old style query semantics for partition by are broken:

S1: ROWKEY => B, C (Meaning S1 has a schema with ROWKEY as the key column, and B and C as value columns - types aren't important).

CREATE STREAM S2 AS SELECT * FROM S1 PARTITION BY B;

S2: ROWKEY => B, C

As you can see the schema of S2 is still the same. However, the old data in the key has been lost as its been overridden with the data from B, and the key now duplicates the data stored in B.

This loss of data on a SELECT * .. PARTITION BY needed fixing.

Secondly, with new primitive key work to remove the restriction on key column naming, the same query semantics do not work. e.g.

S1: A => B, C

CREATE STREAM S2 AS SELECT * FROM S1 PARTITION BY B;

S2: B => B, C

This fails as the B value column clashes with the B key column!

The fix

This commit fixes the PARTITION BY semantics so that any PARTITION BY on a specific column sees the old key column being moved to the value and the new key column being moved from the value to the key. For example,

S1: A => B, C

CREATE STREAM S2 AS SELECT * FROM S1 PARTITION BY B;

Results in the schema: S2: B => C, A.

If a PARTITION BY is an expression other than a column reference, then ksql will synthesis a new column name. For example,

S1: A => B, C

CREATE STREAM S2 AS SELECT * FROM S1 PARTITION BY CAST(B AS INT);

Results in the schema: S2: KSQL_COL_0 => B, C, A.

This github issue will add the ability to use aliases in PARTITION BY expressions, allowing a custom name to be assigned.

The approach

There are main changes:

The LogicalPlanner has been updated to build the new repartitioned schema via a new PartitionByParamsFactory class.
The streams topology is built differently by introducing a new version of the SelectKey plan step. The data passed is the same as the old version. However, the new version is handled by a new builder, which knows to build the streams topology in the right way.

It would also have been possible to achieve the second step by adding a defaulted flag to the existing SelectKey step. However, it was felt that clear separation was better. This means once we go version 1.0 we can just delete the old V1 step, rather than trying to unpick the code that handled a boolean flag.

Testing done

Usual.

Reviewer checklist

Ensure docs are updated if necessary. (eg. if a user visible feature is being added or changed).
Ensure relevant issues are linked (description should include text like "Fixes #")

Fixes: confluentinc#4749 ##### Background This change fixes an issue with our repartition semantics. Old style query semantics for partition by are broken: S1: ROWKEY => B, C (Meaning S1 has a schema with ROWKEY as the key column, and B and C as value columns - types aren't important). ```sql CREATE STREAM S2 AS SELECT * FROM S1 PARTITION BY B; ``` S2: ROWKEY => B, C As you can see the schema of S2 is still the same. However, the old data in the key has been lost as its been overridden with the data from B, and the key now duplicates the data stored in B. This loss of data on a `SELECT * .. PARTITION BY` needed fixing. Secondly, with new primitive key [work to remove the restriction on key column naming](confluentinc#3536), the same query semantics do not work. e.g. S1: A => B, C ```sql CREATE STREAM S2 AS SELECT * FROM S1 PARTITION BY B; ``` S2: B => B, C This fails as the `B` value column clashes with the `B` key column! ##### The fix This commit fixes the PARTITION BY semantics so that any PARTITION BY on a specific column sees the old key column being moved to the value and the new key column being moved from the value to the key. For example, S1: A => B, C ```sql CREATE STREAM S2 AS SELECT * FROM S1 PARTITION BY B; ``` Results in the schema: S2: B => C, A. If a PARTITION BY is an expression other than a column reference, then ksql will synthesis a new column name. For example, S1: A => B, C ```sql CREATE STREAM S2 AS SELECT * FROM S1 PARTITION BY CAST(B AS INT); ``` Results in the schema: S2: KSQL_COL_0 => B, C, A. [This github issue](confluentinc#4813) will add the ability to use aliases in PARTITION BY expressions, allowing a custom name to be assigned.

ksqldb-streams/src/main/java/io/confluent/ksql/execution/streams/PartitionByParamsFactory.java

...db-streams/src/test/java/io/confluent/ksql/execution/streams/StreamSelectKeyBuilderTest.java

ksqldb-common/src/main/java/io/confluent/ksql/name/ColumnNames.java

ksqldb-functional-tests/src/test/resources/query-validation-tests/partition-by.json

ksqldb-rest-app/src/test/resources/ksql-plan-schema/schema.json

agavra

Thanks for the explanations! LGTM

big-andy-coates requested a review from a team as a code owner March 18, 2020 18:44

This was referenced Mar 19, 2020

fix: Stop PARTITION BY and UDTFs that fail from terminating the query #4822

Merged

fix: do not filter out rows where PARTITION BY resolves to null #4823

Merged

big-andy-coates requested a review from rodesai March 19, 2020 15:53

agavra reviewed Mar 19, 2020

View reviewed changes

chore: changes requested by Almog

8e35587

big-andy-coates requested a review from agavra March 20, 2020 11:12

agavra approved these changes Mar 20, 2020

View reviewed changes

big-andy-coates merged commit 609e9e2 into confluentinc:master Mar 20, 2020

big-andy-coates deleted the paratition_by_semantics branch March 20, 2020 15:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: fix repartition semantics #4816

fix: fix repartition semantics #4816

big-andy-coates commented Mar 18, 2020 •

edited

Loading

agavra left a comment

fix: fix repartition semantics #4816

fix: fix repartition semantics #4816

Conversation

big-andy-coates commented Mar 18, 2020 • edited Loading

Description

Background

The fix

The approach

Testing done

Reviewer checklist

agavra left a comment

Choose a reason for hiding this comment

big-andy-coates commented Mar 18, 2020 •

edited

Loading