fix: avoid duplicate column name errors from auto-generated aliases #4827

big-andy-coates · 2020-03-19T15:52:48Z

Description

If a source has column names in the form KSQL_COL_x, e.g. like those created is a select expression isn't a column reference and isn't aliased, then the existing code can result in an error about duplicate column names if a downstream query also relies on auto-generated column names. For example,

-- schema: ID -> NAME
CREATE STREAM S1 (ID INT KEY, NAME STRING) WITH (kafka_topic='input', value_format='JSON');

-- schema: ID -> KSQL_COL_0, NAME
CREATE STREAM S2 AS SELECT LEN(STRING), NAME FROM S1;

-- both of the following result in a duplicate column error as two columns end up being called KSQL_COL_0:
SELECT UCASE(NAME), * FROM S2;
SELECT UCASE(NAME), KSQL_COL_0 FROM S2;

The issue is that the generation of aliases does not take into account any existing columns in the source(s).

This commit resolves this issue. Queries now start their generated alias index as one-more-than the maximum of any generated alias in any of the sources.

With this change the above queries work:

-- schema: KSQL_COL_1, KSQL_COL_0, NAME
SELECT UCASE(NAME), * FROM S2;

-- schema: KSQL_COL_1, KSQL_COL_1
SELECT UCASE(NAME), KSQL_COL_0 FROM S2;

BREAKING CHANGE:

Potential breaking change for any query started before version 0.8:

Any existing persistent queries, e.g. those created with CREATE STREAM AS SELECT, CREATE TABLE AS SELECT or INSERT INTO
- starting on version 8.0 or later: will be unaffected and their column names will not change.
- started on a version before 0.8 and using auto-generated column names: may see there column names change, leading to undefined behaviour.
Pull queries will be unaffected.
Push queries, which rely on auto-generated column names, may see a change in column names.

Testing done

Usual

Reviewing notes:

First commit has the main code changes.
Second commit has the updates to the tests due to the change in column naming.

Reviewer checklist

Ensure docs are updated if necessary. (eg. if a user visible feature is being added or changed).
Ensure relevant issues are linked (description should include text like "Fixes #")

If a source has column names in the form `KSQL_COL_x`, e.g. like those created is a select expression isn't a column reference and isn't aliased, then the existing code can result in an error about duplicate column names if a downstream query also relies on auto-generated column names. For example, ```sql -- schema: ID -> NAME CREATE STREAM S1 (ID INT KEY, NAME STRING) WITH (kafka_topic='input', value_format='JSON'); -- schema: ID -> KSQL_COL_0, NAME CREATE STREAM S2 AS SELECT LEN(STRING), NAME FROM S1; -- both of the following result in a duplicate column error as two columns end up being called KSQL_COL_0: SELECT UCASE(NAME), * FROM S2; SELECT UCASE(NAME), KSQL_COL_0 FROM S2; ``` The issue is that the generation of aliases does not take into account any existing columns in the source(s). This commit resolves this issue. Queries now start their generated alias index as one-more-than the maximum of any generated alias in any of the sources. With this change the above queries work: ```sql -- schema: KSQL_COL_1, KSQL_COL_0, NAME SELECT UCASE(NAME), * FROM S2; -- schema: KSQL_COL_1, KSQL_COL_1 SELECT UCASE(NAME), KSQL_COL_0 FROM S2; ``` BREAKING CHANGE: * Any existing persistent queries, e.g. those created with `CREATE STREAM AS SELECT`, `CREATE TABLE AS SELECT` or `INSERT INTO`, will be unaffected: their column names will not change. * Pull queries will be unaffected. * Push queries, which rely on auto-generated column names, may see a change in column names.

Tests that needed updating to pass with the new column names.

Conflicting files ksqldb-common/src/main/java/io/confluent/ksql/name/ColumnNames.java ksqldb-common/src/test/java/io/confluent/ksql/name/ColumnNamesTest.java

ksqldb-common/src/main/java/io/confluent/ksql/name/ColumnNames.java

vpapavas · 2020-03-23T21:51:34Z

You use a Set to keep the aliases that have been previously used. I am wondering why this is necessary and why using the max does not work.
The aliases are increasing monotonically. If we have a single source, it should suffice to just take the current max and increase it by one to get the next unused alias, right? If we have multiple sources where one is at index 3 and the other at index 6 for example, then we take the max of both and add one which should return 7. This should guarantee that the alias is unused, no?
Looking at the old code it seems that it is covering the case of a single source schema. So, I am wondering why the examples you give fail.

rodesai · 2020-03-24T06:07:45Z

ksqldb-common/src/main/java/io/confluent/ksql/name/ColumnNames.java

+        .boxed()
+        .collect(Collectors.toSet());
+
+    return new AliasGenerator(0, used)::next;


nit: if we pass 1 here we should get names more consistent with what we had before (start at offset 1), and a smaller test diff.

I thought that too and tried it. However, the old code started at zero too. If we switch this to 1 we just get a different set of tests to change.

private int selectItemIndex = 0;

The changes are because the name is no longer controlled by the index of the select expresssion.

rodesai

LGTM, one optional nit inline

big-andy-coates · 2020-03-24T09:33:54Z

You use a Set to keep the aliases that have been previously used. I am wondering why this is necessary and why using the max does not work.
The aliases are increasing monotonically. If we have a single source, it should suffice to just take the current max and increase it by one to get the next unused alias, right? If we have multiple sources where one is at index 3 and the other at index 6 for example, then we take the max of both and add one which should return 7. This should guarantee that the alias is unused, no?
Looking at the old code it seems that it is covering the case of a single source schema. So, I am wondering why the examples you give fail.

We don't control the column names in the source. It's therefore possible that even a single source could have a column names in the form KSQL_COL_x where x of one is close to 0 and another is close to Integer.MAX_VALUE. If we just take the max and increment, then we need to handle overflow of the counter and risk clashing with other columns. For example,

CREATE STREAM FOO (KSQL_COL_0 INT KEY, KSQL_COL_2147483647 NAME) WITH (...);
SELECT KSQL_COL_0, ABS(KSQL_COL_0) FROM FOO;

The above will fail with your proposed strategy as both columns will be called KSQL_COL_0.

While we don't control the name of columns, it's unlikely that there will be millions of columns! Hence, tracking the columns we know we can't use is the easiest solution.

big-andy-coates added 2 commits March 19, 2020 15:49

chore: test updates

fadc4d4

Tests that needed updating to pass with the new column names.

big-andy-coates requested a review from a team as a code owner March 19, 2020 15:52

chore: historic plans

8624110

big-andy-coates mentioned this pull request Mar 20, 2020

fix: fix repartition semantics #4816

Merged

2 tasks

big-andy-coates added 4 commits March 20, 2020 18:09

chore: merge from master

d118103

Conflicting files ksqldb-common/src/main/java/io/confluent/ksql/name/ColumnNames.java ksqldb-common/src/test/java/io/confluent/ksql/name/ColumnNamesTest.java

Merge branch 'master' into ksql_col_clashes

de871ea

chore: merge from master

53de5fe

Merge branch 'master' into ksql_col_clashes

b72da69

vpapavas reviewed Mar 23, 2020

View reviewed changes

ksqldb-common/src/main/java/io/confluent/ksql/name/ColumnNames.java Show resolved Hide resolved

vpapavas self-assigned this Mar 23, 2020

rodesai reviewed Mar 24, 2020

View reviewed changes

rodesai approved these changes Mar 24, 2020

View reviewed changes

big-andy-coates requested a review from vpapavas March 24, 2020 09:34

big-andy-coates added 3 commits March 24, 2020 10:35

chore: fix pull query tests

4d987a7

Merge branch 'master' into ksql_col_clashes

9aa4e58

chore: historical plans

1812153

big-andy-coates merged commit 258d0b0 into confluentinc:master Mar 24, 2020

big-andy-coates deleted the ksql_col_clashes branch March 24, 2020 15:00

fjbecerra mentioned this pull request Oct 29, 2020

CTAS: Erroneous "Duplicate values found in schema" #3976

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: avoid duplicate column name errors from auto-generated aliases #4827

fix: avoid duplicate column name errors from auto-generated aliases #4827

big-andy-coates commented Mar 19, 2020 •

edited

Loading

vpapavas commented Mar 23, 2020

rodesai Mar 24, 2020

big-andy-coates Mar 24, 2020 •

edited

Loading

rodesai left a comment

big-andy-coates commented Mar 24, 2020

fix: avoid duplicate column name errors from auto-generated aliases #4827

fix: avoid duplicate column name errors from auto-generated aliases #4827

Conversation

big-andy-coates commented Mar 19, 2020 • edited Loading

Description

Testing done

Reviewing notes:

Reviewer checklist

vpapavas commented Mar 23, 2020

rodesai Mar 24, 2020

Choose a reason for hiding this comment

big-andy-coates Mar 24, 2020 • edited Loading

Choose a reason for hiding this comment

rodesai left a comment

Choose a reason for hiding this comment

big-andy-coates commented Mar 24, 2020

big-andy-coates commented Mar 19, 2020 •

edited

Loading

big-andy-coates Mar 24, 2020 •

edited

Loading