Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: avoid duplicate column name errors from auto-generated aliases #4827

Merged
merged 10 commits into from
Mar 24, 2020

Conversation

big-andy-coates
Copy link
Contributor

@big-andy-coates big-andy-coates commented Mar 19, 2020

Description

If a source has column names in the form KSQL_COL_x, e.g. like those created is a select expression isn't a column reference and isn't aliased, then the existing code can result in an error about duplicate column names if a downstream query also relies on auto-generated column names. For example,

-- schema: ID -> NAME
CREATE STREAM S1 (ID INT KEY, NAME STRING) WITH (kafka_topic='input', value_format='JSON');

-- schema: ID -> KSQL_COL_0, NAME
CREATE STREAM S2 AS SELECT LEN(STRING), NAME FROM S1;

-- both of the following result in a duplicate column error as two columns end up being called KSQL_COL_0:
SELECT UCASE(NAME), * FROM S2;
SELECT UCASE(NAME), KSQL_COL_0 FROM S2;

The issue is that the generation of aliases does not take into account any existing columns in the source(s).

This commit resolves this issue. Queries now start their generated alias index as one-more-than the maximum of any generated alias in any of the sources.

With this change the above queries work:

-- schema: KSQL_COL_1, KSQL_COL_0, NAME
SELECT UCASE(NAME), * FROM S2;

-- schema: KSQL_COL_1, KSQL_COL_1
SELECT UCASE(NAME), KSQL_COL_0 FROM S2;

BREAKING CHANGE:

Potential breaking change for any query started before version 0.8:

  • Any existing persistent queries, e.g. those created with CREATE STREAM AS SELECT, CREATE TABLE AS SELECT or INSERT INTO
    • starting on version 8.0 or later: will be unaffected and their column names will not change.
    • started on a version before 0.8 and using auto-generated column names: may see there column names change, leading to undefined behaviour.
  • Pull queries will be unaffected.
  • Push queries, which rely on auto-generated column names, may see a change in column names.

Testing done

Usual

Reviewing notes:

  • First commit has the main code changes.
  • Second commit has the updates to the tests due to the change in column naming.

Reviewer checklist

  • Ensure docs are updated if necessary. (eg. if a user visible feature is being added or changed).
  • Ensure relevant issues are linked (description should include text like "Fixes #")

If a source has column names in the form `KSQL_COL_x`, e.g. like those created is a select expression isn't a column reference and isn't aliased, then the existing code can result in an error about duplicate column names if a downstream query also relies on auto-generated column names. For example,

```sql
-- schema: ID -> NAME
CREATE STREAM S1 (ID INT KEY, NAME STRING) WITH (kafka_topic='input', value_format='JSON');

-- schema: ID -> KSQL_COL_0, NAME
CREATE STREAM S2 AS SELECT LEN(STRING), NAME FROM S1;

-- both of the following result in a duplicate column error as two columns end up being called KSQL_COL_0:
SELECT UCASE(NAME), * FROM S2;
SELECT UCASE(NAME), KSQL_COL_0 FROM S2;
```

The issue is that the generation of aliases does not take into account any existing columns in the source(s).

This commit resolves this issue. Queries now start their generated alias index as one-more-than the maximum of any generated alias in any of the sources.

With this change the above queries work:

```sql
-- schema: KSQL_COL_1, KSQL_COL_0, NAME
SELECT UCASE(NAME), * FROM S2;

-- schema: KSQL_COL_1, KSQL_COL_1
SELECT UCASE(NAME), KSQL_COL_0 FROM S2;
```

BREAKING CHANGE:
* Any existing persistent queries, e.g. those created with `CREATE STREAM AS SELECT`, `CREATE TABLE AS SELECT` or `INSERT INTO`, will be unaffected: their column names will not change.
* Pull queries will be unaffected.
* Push queries, which rely on auto-generated column names, may see a change in column names.
Tests that needed updating to pass with the new column names.
@big-andy-coates big-andy-coates requested a review from a team as a code owner March 19, 2020 15:52
Conflicting files
ksqldb-common/src/main/java/io/confluent/ksql/name/ColumnNames.java
ksqldb-common/src/test/java/io/confluent/ksql/name/ColumnNamesTest.java
@vpapavas
Copy link
Member

You use a Set to keep the aliases that have been previously used. I am wondering why this is necessary and why using the max does not work.
The aliases are increasing monotonically. If we have a single source, it should suffice to just take the current max and increase it by one to get the next unused alias, right? If we have multiple sources where one is at index 3 and the other at index 6 for example, then we take the max of both and add one which should return 7. This should guarantee that the alias is unused, no?
Looking at the old code it seems that it is covering the case of a single source schema. So, I am wondering why the examples you give fail.

@vpapavas vpapavas self-assigned this Mar 23, 2020
.boxed()
.collect(Collectors.toSet());

return new AliasGenerator(0, used)::next;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: if we pass 1 here we should get names more consistent with what we had before (start at offset 1), and a smaller test diff.

Copy link
Contributor Author

@big-andy-coates big-andy-coates Mar 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that too and tried it. However, the old code started at zero too. If we switch this to 1 we just get a different set of tests to change.

 private int selectItemIndex = 0;

The changes are because the name is no longer controlled by the index of the select expresssion.

Copy link
Contributor

@rodesai rodesai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, one optional nit inline

@big-andy-coates
Copy link
Contributor Author

You use a Set to keep the aliases that have been previously used. I am wondering why this is necessary and why using the max does not work.
The aliases are increasing monotonically. If we have a single source, it should suffice to just take the current max and increase it by one to get the next unused alias, right? If we have multiple sources where one is at index 3 and the other at index 6 for example, then we take the max of both and add one which should return 7. This should guarantee that the alias is unused, no?
Looking at the old code it seems that it is covering the case of a single source schema. So, I am wondering why the examples you give fail.

We don't control the column names in the source. It's therefore possible that even a single source could have a column names in the form KSQL_COL_x where x of one is close to 0 and another is close to Integer.MAX_VALUE. If we just take the max and increment, then we need to handle overflow of the counter and risk clashing with other columns. For example,

CREATE STREAM FOO (KSQL_COL_0 INT KEY, KSQL_COL_2147483647 NAME) WITH (...);
SELECT KSQL_COL_0, ABS(KSQL_COL_0) FROM FOO;

The above will fail with your proposed strategy as both columns will be called KSQL_COL_0.

While we don't control the name of columns, it's unlikely that there will be millions of columns! Hence, tracking the columns we know we can't use is the easiest solution.

@big-andy-coates big-andy-coates merged commit 258d0b0 into confluentinc:master Mar 24, 2020
@big-andy-coates big-andy-coates deleted the ksql_col_clashes branch March 24, 2020 15:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants