fix: do not filter out rows where PARTITION BY resolves to null #4823

big-andy-coates · 2020-03-19T14:19:46Z

Description

Note: stacked on top of #4816, (which is now merged)

This commit removes the filter that was excluding any rows where the PARTITION BY clause resolved to a null value, i.e. either because the result was null or because an error occurred evaluating the expression.

This change will only affect new queries started. Pre-existing queries will continue to run as before.

Testing done

QTT tests added.

Reviewer checklist

Ensure docs are updated if necessary. (eg. if a user visible feature is being added or changed).
Ensure relevant issues are linked (description should include text like "Fixes #")

Fixes: confluentinc#4749 ##### Background This change fixes an issue with our repartition semantics. Old style query semantics for partition by are broken: S1: ROWKEY => B, C (Meaning S1 has a schema with ROWKEY as the key column, and B and C as value columns - types aren't important). ```sql CREATE STREAM S2 AS SELECT * FROM S1 PARTITION BY B; ``` S2: ROWKEY => B, C As you can see the schema of S2 is still the same. However, the old data in the key has been lost as its been overridden with the data from B, and the key now duplicates the data stored in B. This loss of data on a `SELECT * .. PARTITION BY` needed fixing. Secondly, with new primitive key [work to remove the restriction on key column naming](confluentinc#3536), the same query semantics do not work. e.g. S1: A => B, C ```sql CREATE STREAM S2 AS SELECT * FROM S1 PARTITION BY B; ``` S2: B => B, C This fails as the `B` value column clashes with the `B` key column! ##### The fix This commit fixes the PARTITION BY semantics so that any PARTITION BY on a specific column sees the old key column being moved to the value and the new key column being moved from the value to the key. For example, S1: A => B, C ```sql CREATE STREAM S2 AS SELECT * FROM S1 PARTITION BY B; ``` Results in the schema: S2: B => C, A. If a PARTITION BY is an expression other than a column reference, then ksql will synthesis a new column name. For example, S1: A => B, C ```sql CREATE STREAM S2 AS SELECT * FROM S1 PARTITION BY CAST(B AS INT); ``` Results in the schema: S2: KSQL_COL_0 => B, C, A. [This github issue](confluentinc#4813) will add the ability to use aliases in PARTITION BY expressions, allowing a custom name to be assigned.

Fixes: confluentinc#4747 This commit removes the filter that was excluding any rows where the `PARTITION BY` clause resolved to a `null` value, i.e. either because the result was `null` or because an error occurred evaluating the expression. This change will only affect new queries started. Pre-existing queries will continue to run as before.

big-andy-coates · 2020-03-19T17:20:08Z

FYI, the docs call out the limitations of partitioning by a null, and suggest using COALESCE from #4829 to avoid.

docs-md/developer-guide/joins/partition-data.md

JimGalasyn

LGTM, with a couple of suggestions.

Co-Authored-By: Jim Galasyn <[email protected]>

Conflicting files ksqldb-functional-tests/src/test/resources/query-validation-tests/partition-by.json ksqldb-streams/src/main/java/io/confluent/ksql/execution/streams/PartitionByParams.java ksqldb-streams/src/main/java/io/confluent/ksql/execution/streams/PartitionByParamsFactory.java ksqldb-streams/src/main/java/io/confluent/ksql/execution/streams/StreamSelectKeyBuilder.java ksqldb-streams/src/test/java/io/confluent/ksql/execution/streams/PartitionByParamsFactoryTest.java ksqldb-streams/src/test/java/io/confluent/ksql/execution/streams/StreamSelectKeyBuilderTest.java

big-andy-coates · 2020-03-20T17:45:20Z

ksqldb-functional-tests/src/test/resources/query-validation-tests/joins.json

-        {"topic": "_confluent-ksql-some.ksql.service.idquery_CSAS_LEFT_OUTER_JOIN_0-KSTREAM-OUTEROTHER-0000000019-store-changelog", "window": {"start": 16000, "end": 27000, "type": "time"}, "key": 100, "value": {"TT_ROWTIME": 16000, "TT_K": "", "TT_ID": 100, "TT_F1": "newblah", "TT_F2": 150}, "timestamp": 16000},
-        {"topic": "_confluent-ksql-some.ksql.service.idquery_CSAS_LEFT_OUTER_JOIN_0-KSTREAM-JOINTHIS-0000000018-store-changelog", "window": {"start": 17000, "end": 28000, "type": "time"}, "key": 90, "value": {"T_ROWTIME": 17000, "T_K": "", "T_ID": 90, "T_NAME": "ninety", "T_VALUE": 90}, "timestamp": 17000},
-        {"topic": "_confluent-ksql-some.ksql.service.idquery_CSAS_LEFT_OUTER_JOIN_0-KSTREAM-JOINTHIS-0000000018-store-changelog", "window": {"start": 30000, "end": 41000, "type": "time"}, "key": 0, "value": {"T_ROWTIME": 30000, "T_K": "", "T_ID": 0, "T_NAME": "bar", "T_VALUE": 99}, "timestamp": 30000},
+        {"topic": "_confluent-ksql-some.ksql.service.idquery_CSAS_LEFT_OUTER_JOIN_0-KSTREAM-JOINTHIS-0000000016-store-changelog", "window": {"start": 0, "end": 11000, "type": "time"}, "key": 0, "value": {"T_ROWTIME": 0, "T_K": "", "T_ID": 0, "T_NAME": "zero", "T_VALUE": 0}, "timestamp": 0},


Note: it's safe to change these as they are from an unreleased feature...

agavra

I'm not sure I understand why this is backwards compatible on a server restart - is the predicate somehow encoded in the query plan?

agavra

Andy answered my question offline - old queries will go through the StreamSelectKeyBuidlerV1 which maintains the old behavior.

big-andy-coates added 2 commits March 18, 2020 18:42

big-andy-coates requested a review from a team as a code owner March 19, 2020 14:19

docs: call out limitation of partiiton by NULL

2b1750b

big-andy-coates requested a review from JimGalasyn as a code owner March 19, 2020 17:19

JimGalasyn reviewed Mar 19, 2020

View reviewed changes

docs-md/developer-guide/joins/partition-data.md Outdated Show resolved Hide resolved

JimGalasyn reviewed Mar 19, 2020

View reviewed changes

docs-md/developer-guide/joins/partition-data.md Outdated Show resolved Hide resolved

JimGalasyn approved these changes Mar 19, 2020

View reviewed changes

agavra mentioned this pull request Mar 19, 2020

fix: fix repartition semantics #4816

Merged

2 tasks

big-andy-coates and others added 5 commits March 20, 2020 10:44

Update docs-md/developer-guide/joins/partition-data.md

8eafb3e

Co-Authored-By: Jim Galasyn <[email protected]>

Update docs-md/developer-guide/joins/partition-data.md

f077bf4

Co-Authored-By: Jim Galasyn <[email protected]>

test: fix tsts

243af69

Merge branch 'master' into partition_by_null

64a1118

big-andy-coates commented Mar 20, 2020

View reviewed changes

agavra reviewed Mar 23, 2020

View reviewed changes

agavra approved these changes Mar 23, 2020

View reviewed changes

big-andy-coates merged commit e75a792 into confluentinc:master Mar 23, 2020

big-andy-coates deleted the partition_by_null branch March 23, 2020 16:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: do not filter out rows where PARTITION BY resolves to null #4823

fix: do not filter out rows where PARTITION BY resolves to null #4823

big-andy-coates commented Mar 19, 2020 •

edited

Loading

big-andy-coates commented Mar 19, 2020

JimGalasyn left a comment

big-andy-coates Mar 20, 2020

agavra left a comment

agavra left a comment

fix: do not filter out rows where PARTITION BY resolves to null #4823

fix: do not filter out rows where PARTITION BY resolves to null #4823

Conversation

big-andy-coates commented Mar 19, 2020 • edited Loading

Description

Testing done

Reviewer checklist

big-andy-coates commented Mar 19, 2020

JimGalasyn left a comment

Choose a reason for hiding this comment

big-andy-coates Mar 20, 2020

Choose a reason for hiding this comment

agavra left a comment

Choose a reason for hiding this comment

agavra left a comment

Choose a reason for hiding this comment

big-andy-coates commented Mar 19, 2020 •

edited

Loading