Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Push queries: Inconsistency with pull queries, return all rows per window #3871

Closed
vpapavas opened this issue Nov 15, 2019 · 3 comments · Fixed by #4401
Closed

Push queries: Inconsistency with pull queries, return all rows per window #3871

vpapavas opened this issue Nov 15, 2019 · 3 comments · Fixed by #4401

Comments

@vpapavas
Copy link
Member

vpapavas commented Nov 15, 2019

Is your feature request related to a problem? Please describe.
There is an inconsistency between pull and push queries and how windowed rows are handled.

Consider the table order_quantities_windowed created using the CTAS query

CREATE TABLE order_quantities_windowed AS
SELECT
  gen_orderid,
  sum(orderunits) as total_qty,
FROM orders_stream_time
WINDOW TUMBLING (size 1 seconds)
GROUP BY gen_orderid;

Doing this pull query: select * from ORDER_QUANTITIES_WINDOWED where rowkey='order-1'; returns the result

+----------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+
|ROWKEY                                                                                  |WINDOWSTART                                                                             |GEN_ORDERID                                                                             |TOTAL_QTY                                                                               |
+----------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+
|order-1                                                                                 |0                                                                                       |order-1                                                                                 |141.98525929105705                                                                      |
|order-1                                                                                 |1000                                                                                    |order-1                                                                                 |80.4438182250072                                                                        |
|order-1                                                                                 |2000                                                                                    |order-1                                                                                 |62.3613039783794                                                                        |
|order-1                                                                                 |3000                                                                                    |order-1                                                                                 |60.41352844086285                                                                       |
|order-1                                                                                 |4000                                                                                    |order-1                                                                                 |71.69513992454593                                                                       
Query terminated

Now, if we issue the following push query:

select * from order_quantities_windowed  where rowkey='order-1' emit changes;

the result is empty the reason being the WHERE clause evaluates always to false since the ROWKEY has the window information as well.

Describe the solution you'd like
Pull queries ignore the window information in the condition in the WHERE clause but push queries don't ignore it. This is confusing for the user as they don't even know that the ROWKEY has additional data from just looking at the result of the pull query.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

@big-andy-coates
Copy link
Contributor

We should pick this up as part of the work to support structured keys.

@vinothchandar
Copy link
Contributor

Is this targeted for 0.7.0 release?

@apurvam apurvam added this to the 0.7.0 milestone Dec 21, 2019
@apurvam
Copy link
Contributor

apurvam commented Dec 21, 2019

I added it to the 0.7.0 milestone, but it is nice to have IMO and hence not labeled p0. So far I don’t think anybody has complained about this.

@big-andy-coates big-andy-coates self-assigned this Jan 28, 2020
big-andy-coates added a commit to big-andy-coates/ksql that referenced this issue Jan 29, 2020
fixes: confluentinc#3871

Is needed to fix:
- confluentinc#3633
- confluentinc#4015

Before this change the version of `ROWKEY` copied to the value schema during processing of data in the Streams topology was always of type `STRING` regardless of the actual key type. This is because windowed keys had a `ROWKEY` in the format `<actual key> : Window{start=<windowStart>, end=<windowEnd>}`. While `ROWKEY` in the value schema was a `STRING`, `ROWKEY` in the key schema was the actual type, e.g. `INT`.  This is confusing and will lead to bugs.  Also, the formated string isn't very friendly for users.

This change looks to introduce the `WINDOWSTART` and `WINDOWEND` columns that were reserved in confluentinc#4388. The obvious approach would be to add `WINDOWSTART` and `WINDOWEND` as columns in the key schema. Unfortunately, this would be a much bigger change as many parts of the code currently rely on there being only a single key column. The planned structured key work will resolve this.

For now, we only add the windows bounds columns when we `LogicalSchema.withMetaAndKeyColsInValue(true)`. This is a bit of a temporary hack, but gets us where we need to be. This will be cleaned up as part of the structured key work.

With this change `ROWKEY` for windowed sources no longer has the format `<actual key> : Window{start=<windowStart>, end=<windowEnd>}`: `ROWKEY` is now only the _actual_ key and the window bounds can be accessed by `WINDOWSTART` and `WINDOWEND`. These two window bounds columns are included in a pull `SELECT *` query. Likewise a join will include the window bounds columns from both sides in the join result if the join is `SELECT *`.

## Examples:

### Push queries

* A select * on a windowed source will not include `WINDOWSTART` and `WINDOWEND`. `ROWKEY` will be the actual key, not a formatted string.

```
ksql> SELECT * FROM windowedSource emit changes

-- old output
+---------------+------------------------------------------------------+--------+---------+------+
| ROWTIME       | ROWKEY                                               | USERID | PAGEID  | TOTAL|
+---------------+------------------------------------------------------+--------+---------+------+
| 1557183929488 | User_9|+|Page_39 : Window{start=1557183900000 end=-} | User_9 | Page_39 | 1    |
| 1557183930211 | User_1|+|Page_79 : Window{start=1557183900000 end=-} | User_1 | Page_79 | 1    |

-- new output
+---------------+---------------+---------------+------------------+--------+---------+------+
| ROWTIME       | WINDOWSTART   | WINDOWEND     | ROWKEY           | USERID | PAGEID  | TOTAL|
+---------------+---------------+---------------+------------------+--------+---------+------+
| 1557183919786 | 1557183900000 | 1557183960000 | User_5|+|Page_12 | User_5 | Page_12 | 1    |
| 1557183929488 | 1557183900000 | 1557183960000 | User_9|+|Page_39 | User_9 | Page_39 | 1    |
```

* `WINDOWSTART` and `WINDOWEND` are available in the SELECT, GROUPBY, WHERE, HAVING clauses etc.

For example:

```sql
SELECT TIMESTAMPTOSTRING(WINDOWSTART,'yyyy-MM-dd HH:mm:ss Z') FROM windowedSource emit changes;
```

However, don't get too excited just yet as there is a known limitation that drastically reduces the availability of this syntax:

**KNOWN LIMITATION**
Where a query builds a windowed source from a non-windowed source the window bounds columns are not available.  For example:

```
-- won't yet work:
SELECT WINDOWSTART FROM FROM someSource WINDOW TUMBLING (SIZE 1 SECOND) group by ROWKEY;
```

This issue is tracked by: confluentinc#4397

* Joins of windowed sources include the `WINDOWSTART` and `WINDOWEND` columns from both sides.

### Pull queries

**KNOWN LIMITATION**
Pull queries have not been updated yet. This will be done in a follow up PR confluentinc#3633. This is mainly to keep this PR manageable.

### Persistent queries

Persistent C*AS queries work similar to push queries and have the same known limitation.

BREAKING CHANGE: Any query of a windowed source that uses `ROWKEY` in the SELECT projection will see the contents of `ROWKEY` change from a formatted `STRING` containing the underlying key and the window bounds, to just the underlying key.  Queries can access the window bounds using `WINDOWSTART` and `WINDOWEND`.

BREAKING CHANGE: Joins on windowed sources now include `WINDOWSTART` and `WINDOWEND` columns from both sides on a `SELECT *`.
big-andy-coates added a commit that referenced this issue Jan 29, 2020
#4401)

* chore: support window bounds columns in persistent and pull queries

fixes: #3871

Is needed to fix:
- #3633
- #4015

Before this change the version of `ROWKEY` copied to the value schema during processing of data in the Streams topology was always of type `STRING` regardless of the actual key type. This is because windowed keys had a `ROWKEY` in the format `<actual key> : Window{start=<windowStart>, end=<windowEnd>}`. While `ROWKEY` in the value schema was a `STRING`, `ROWKEY` in the key schema was the actual type, e.g. `INT`.  This is confusing and will lead to bugs.  Also, the formated string isn't very friendly for users.

This change looks to introduce the `WINDOWSTART` and `WINDOWEND` columns that were reserved in #4388. The obvious approach would be to add `WINDOWSTART` and `WINDOWEND` as columns in the key schema. Unfortunately, this would be a much bigger change as many parts of the code currently rely on there being only a single key column. The planned structured key work will resolve this.

For now, we only add the windows bounds columns when we `LogicalSchema.withMetaAndKeyColsInValue(true)`. This is a bit of a temporary hack, but gets us where we need to be. This will be cleaned up as part of the structured key work.

With this change `ROWKEY` for windowed sources no longer has the format `<actual key> : Window{start=<windowStart>, end=<windowEnd>}`: `ROWKEY` is now only the _actual_ key and the window bounds can be accessed by `WINDOWSTART` and `WINDOWEND`. These two window bounds columns are included in a pull `SELECT *` query. Likewise a join will include the window bounds columns from both sides in the join result if the join is `SELECT *`.

## Examples:

### Push queries

* A select * on a windowed source will not include `WINDOWSTART` and `WINDOWEND`. `ROWKEY` will be the actual key, not a formatted string.

```
ksql> SELECT * FROM windowedSource emit changes

-- old output
+---------------+------------------------------------------------------+--------+---------+------+
| ROWTIME       | ROWKEY                                               | USERID | PAGEID  | TOTAL|
+---------------+------------------------------------------------------+--------+---------+------+
| 1557183929488 | User_9|+|Page_39 : Window{start=1557183900000 end=-} | User_9 | Page_39 | 1    |
| 1557183930211 | User_1|+|Page_79 : Window{start=1557183900000 end=-} | User_1 | Page_79 | 1    |

-- new output
+---------------+---------------+---------------+------------------+--------+---------+------+
| ROWTIME       | WINDOWSTART   | WINDOWEND     | ROWKEY           | USERID | PAGEID  | TOTAL|
+---------------+---------------+---------------+------------------+--------+---------+------+
| 1557183919786 | 1557183900000 | 1557183960000 | User_5|+|Page_12 | User_5 | Page_12 | 1    |
| 1557183929488 | 1557183900000 | 1557183960000 | User_9|+|Page_39 | User_9 | Page_39 | 1    |
```

* `WINDOWSTART` and `WINDOWEND` are available in the SELECT, GROUPBY, WHERE, HAVING clauses etc.

For example:

```sql
SELECT TIMESTAMPTOSTRING(WINDOWSTART,'yyyy-MM-dd HH:mm:ss Z') FROM windowedSource emit changes;
```

However, don't get too excited just yet as there is a known limitation that drastically reduces the availability of this syntax:

**KNOWN LIMITATION**
Where a query builds a windowed source from a non-windowed source the window bounds columns are not available.  For example:

```
-- won't yet work:
SELECT WINDOWSTART FROM FROM someSource WINDOW TUMBLING (SIZE 1 SECOND) group by ROWKEY;
```

This issue is tracked by: #4397

* Joins of windowed sources include the `WINDOWSTART` and `WINDOWEND` columns from both sides.

### Pull queries

**KNOWN LIMITATION**
Pull queries have not been updated yet. This will be done in a follow up PR #3633. This is mainly to keep this PR manageable.

### Persistent queries

Persistent C*AS queries work similar to push queries and have the same known limitation.

BREAKING CHANGE: Any query of a windowed source that uses `ROWKEY` in the SELECT projection will see the contents of `ROWKEY` change from a formatted `STRING` containing the underlying key and the window bounds, to just the underlying key.  Queries can access the window bounds using `WINDOWSTART` and `WINDOWEND`.

BREAKING CHANGE: Joins on windowed sources now include `WINDOWSTART` and `WINDOWEND` columns from both sides on a `SELECT *`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants