Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: add klip-24: key column semantics in queries. #5115

Merged
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
150 changes: 134 additions & 16 deletions design-proposals/klip-24-key-column-semantics-in-queries.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,8 @@
**Discussion**: TBD

**tl;dr:** Persistent queries do not allow the key column in the projection as key columns are
currently _implicitly_ copied across. This KLIP proposes flipping this so that the key column is
_not_ implicitly copied across, and is therefore _required_ in persistent query projections.
currently _implicitly_ copied across. This is not intuitive to anyone familiar with SQL. This KLIP
proposes flipping this so that the key column is _not_ implicitly copied across.

## Motivation and background

Expand Down Expand Up @@ -47,9 +47,7 @@ columns named `ID` and hence the query fails.
We propose that the above `CREATE TABLE` statement should work, as the projection is equivalent to
`select *`, which works.

### Select star

The differences with a 'select *' projection are more subtle.
### Non-join select star

For non-join queries, both transient and persistent queries select all the columns in the schema:

Expand All @@ -63,8 +61,13 @@ CREATE TABLE OUTPUT AS SELECT * FROM INPUT;
-- resulting schema: ID INT KEY, V0 INT, V1 INT (Same as above)
```

### Joins

The differences with joins is more subtle.

With a join query, a transient query selects all columns from all sources. A persistent query adds
all columns from all sources, but also adds an additional copy of the left key column.
all columns from all sources, and also adds an additional column that stores the result of the join
criteria.

```sql
SELECT * FROM INPUT I1 JOIN INPUT_2 I2 ON I1.ID = I2.ID;
Expand All @@ -73,17 +76,79 @@ SELECT * FROM INPUT I1 JOIN INPUT_2 I2 ON I1.ID = I2.ID;
-- vs --

CREATE TABLE OUTPUT AS SELECT * FROM INPUT I1 JOIN INPUT_2 I2 ON I1.ID = I2.ID;
-- resulting schema: ID INT KEY, I1_ID INT, I1_V0 INT, I1_V1 INT, I2_ID INT, I2_V0 INT, I2_V1 INT
-- resulting schema (any key name enabled): ID INT KEY, I1_ID INT, I1_V0 INT, I1_V1 INT, I2_ID INT, I2_V0 INT, I2_V1 INT
big-andy-coates marked this conversation as resolved.
Show resolved Hide resolved
-- resulting schema (any key name disabled): ROWKEY INT KEY, I1_ID INT, I1_V0 INT, I1_V1 INT, I2_ID INT, I2_V0 INT, I2_V1 INT
```

Note the addition of an additional `ID` column in the case of the persistent query.
Note the addition of an additional `ID` or `ROWKEY` column in the case of the persistent query.

#### Inner and left outer joins on column references

For inner and left outer joins where the left join criteria is a column reference the the
big-andy-coates marked this conversation as resolved.
Show resolved Hide resolved
additional `ID` column is a duplicate of the `I1.ID` column.

We propose that such joins should not duplicate the left join column `I1.ID` into both the `ID` key
and `I1_ID` value column. Instead, the key column should be named `I1_ID` and no copy should be stored
in the value.

#### Full outer joins and non-column joins

Where the join is a full outer join, or where the left join criteria is not a column reference, the
problem is more nuanced.

```sql
-- inner join on expression:
CREATE TABLE OUTPUT AS SELECT * FROM INPUT I1 JOIN INPUT_2 I2 ON ABS(I1.ID) = I2.ID;
big-andy-coates marked this conversation as resolved.
Show resolved Hide resolved

-- full outer join:
CREATE TABLE OUTPUT AS SELECT * FROM INPUT I1 OUTER JOIN INPUT_2 I2 ON I1.ID = I2.ID;
```

We propose that the join should not duplicate the left join column `I1.ID` into both the `ID` key
and `I1_ID` value column.
For both of the above joins the data stored in the Kafka record's key by ksqlDB / Streams does
not correspond to any column within either source. This is problematic.

Only the key column should exist and it should be called either (TBD):
- `ID`, or
- `I1_ID`.
This KLIP proposes that the projection should include _all the columns expected in the result_.
Logically, this must include this new key column. Yet, this new key column is an artifact of the
current join implementation, so what should it be called and how should the user include it in the
projection?

The synthesised column is currently named `ROWKEY`. However, the 'any key name' feature removes the
`ROWKEY` system column in favour of user supplied names or system generated ones in the form
`KSQL_COL_x`.

If we are to require users to explicitly include this synthesised column in any projection with
explicit columns, i.e. non select-star projections, then the user must be able to determine the name
and be able to provide their own. This precludes the user of a system generated name such as
`KSQL_COL_0`.

Though not ideal, we propose that in the short term the synthesised column will be named `ROWKEY`,
as this is a name users are already familiar with. For example, the above examples that used `*` in
their projections could be expanded to the following explicit column lists:
big-andy-coates marked this conversation as resolved.
Show resolved Hide resolved

```sql
-- inner join on expression:
CREATE TABLE OUTPUT AS SELECT ROWKEY, I1.ID, I1.V0, I1.V1, I2.ID, I2.V0, I2.V1 INT FROM INPUT I1 JOIN INPUT_2 I2 ON ABS(I1.ID) = I2.ID;
-- resulting schema: ROWKEY INT KEY, I1_ID INT, I1_V0 INT, I1_V1 INT, I2_ID INT, I2_V0 INT, I2_V1 INT

-- full outer join:
CREATE TABLE OUTPUT AS SELECT ROWKEY, I1.ID, I1.V0, I1.V1, I2.ID, I2.V0, I2.V1 FROM INPUT I1 OUTER JOIN INPUT_2 I2 ON I1.ID = I2.ID;
-- resulting schema: ROWKEY INT KEY, I1_ID INT, I1_V0 INT, I1_V1 INT, I2_ID INT, I2_V0 INT, I2_V1 INT
```

Requiring the user to include a synthesised column in the projection is not idea. However, we
propose this is best short term solution given the current functionality.

Key to this solution is the ability for users to provide their own name for the synthesised key
column. For example:

```sql
CREATE TABLE OUTPUT AS SELECT ROWKEY AS ID, I1.V0, I1.V1, I2.V0, I2.V1 INT FROM INPUT I1 OUTER JOIN INPUT_2 I2 ON I1.ID = I2.ID;
-- resulting schema: ID INT KEY, I1_V0 INT, I1_V1 INT, I2_V0 INT, I2_V1 INT
```

A more correct implementation might store both sides of the join criteria in the key for all join
types. However, such an approach would require support for multiple key columns and extensive
changes in Streams. See [rejected alternatives](#rejected_alternatives) for more info.

### Removal of non-standard Aliasing

Expand Down Expand Up @@ -134,7 +199,7 @@ CREATE STREAM INPUT (ID INT KEY, V0 INT, V1 INT) WITH (...);
CREATE TABLE OUTPUT AS SELECT V0, COUNT(*) AS COUNT FROM INPUT GROUP BY V0;
-- fails with duplicate column error on 'V0'.

-- Note: the equivalent query would of failed before the any-key-name work:
-- Note: the equivalent query would have failed before the any-key-name work:
CREATE TABLE OUTPUT AS SELECT ROWKEY, COUNT(*) AS COUNT FROM INPUT GROUP BY V0;
-- fails with duplicate column error on 'ROWKEY'.
```
Expand Down Expand Up @@ -289,11 +354,11 @@ example:

```sql
-- 'any key' aliasing syntax that will be dropped:
CREATE TABLE OUTPUT AS SELECT COUNT() AS COUNT FROM INPUT GROUP BY (V0 + V1) AS K;
CREATE TABLE OUTPUT AS SELECT COUNT() AS COUNT FROM INPUT GROUP BY V0 AS K;
-- resulting schema: K INT KEY, COUNT BIGINT

-- proposed key column aliasing in projection:
CREATE TABLE OUTPUT AS SELECT (V0 + v1) AS K, COUNT() AS COUNT FROM INPUT GROUP BY V0 + V1;
CREATE TABLE OUTPUT AS SELECT V0 AS K, COUNT() AS COUNT FROM INPUT GROUP BY V0;
-- resulting schema: K INT KEY, COUNT BIGINT
```

Expand Down Expand Up @@ -385,3 +450,56 @@ in downstream queries.
This was rejected for this KLIP as it would involve considerably more work. This may be picked up in
a future KLIP.

### Storing all join columns in the key of the result

A more correct solution for handling columns within a join may look to store all join columns in the
Kafka record's key, for example:

```sql
SELECT * FROM INPUT I1 JOIN INPUT_2 I2 ON I1.ID = I2.ID;
-- resulting columns: I1_ID INT, I1_V0 INT, I1_V1 INT, I2_ID INT, I2_V0 INT, I2_V1 INT

CREATE TABLE OUTPUT AS SELECT * FROM INPUT I1 JOIN INPUT_2 I2 ON I1.ID = I2.ID;
-- resulting schema: I1_ID INT KEY, I2_ID INT KEY, I1_V0 INT, I1_V1 INT, I2_V0 INT, I2_V1 INT
```

Note that both `I1_ID` and `I2_ID` are marked as key columns. Such an approach may be required if
ksqlDB is to support join criterion other than the current equality.

However, this is rejected as a solution for now for the following reasons:
a. Such a solution requires ksqlDB to support multiple key columns. It currently does not, and this
KLIP is part of the work moving towards such support. Hence its a chicken and egg problem.
b. Such a solution requires Streams to be able to correctly handle the multiple key columns
correctly, which it currently does not. This is particularly challenging for outer joins,
big-andy-coates marked this conversation as resolved.
Show resolved Hide resolved
where some key columns may initially be `null` and later populated. Any solution needs to ensure
correct partitioning and update semantics for such rows.

### System generated naming for the synthesised join column

Where a join introduces a synthesised key column the column would have a system generated name.
Two naming strategies were rejected:

#### `KSQL_COL_x`

The synthesised column would take on a generated name in the form `KSQL_COL_x`, where `x` is an
integer and the name is guaranteed to be unique, i.e. to not clash with any other system generated
column names.

This was rejected as the name would be hard for the user to know by looking at the query, i.e. the
presence of other columns with generated names may affect the name of the synthesised column. This
was deemed to make this solution unworkable.

#### Data driven naming

The synthesised column would take on a name generated from the join criteria. For example, a join
such as `A JOIN B ON A.ID = B.ID` would result in a key column named `A_ID__B_ID`.

While this is deterministic and offers improved protection against column name clashes than a static
naming strategy, it was rejected as:
a. it adds additional complexity to the code
b. it adds additional cognitive load for users, i.e. the need to know the naming strategy and work
out the name from the criteria.
c. a change in the join criteria requires a change in the projection.
d. name clashes are still possible.
big-andy-coates marked this conversation as resolved.
Show resolved Hide resolved