Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CTAS windows missing data when time already advanced to new window and inserting late-arriving records. #4579

Closed
k1th opened this issue Feb 17, 2020 · 4 comments
Labels

Comments

@k1th
Copy link

k1th commented Feb 17, 2020

Describe the bug
CTAS with older data does not put data in the defined windows. Sorry, for renaming the issues a million times.

To Reproduce
Single node kafka/zk/… with 5.4.0, cp-ksql-server
Retention setting on broker 700 days

Make the retention for changelog topics really big and set auto offset reset:

ksql> SET 'auto.offset.reset' = 'earliest';
Successfully changed local property 'auto.offset.reset' to 'earliest'. Use the UNSET command to revert your change.
ksql> set 'windowstore.changelog.additional.retention.ms' = '30000000000';
Successfully changed local property 'windowstore.changelog.additional.retention.ms' to '30000000000'. Use the UNSET command to revert your change.

Use the tutorial from https://kafka-tutorials.confluent.io/create-tumbling-windows/ksql.html#write-the-program-interactively-using-the-cli updated for 5.4.0

and fill the ratings topic with some data. I ended up with this (only relevant columns shown – just added more random data in the past):

ksql> select title, timestamp from ratings emit changes;
+---------------------------------------------------------------------------+---------------------------------------------------------------------------+
|TITLE                                                                      |TIMESTAMP                                                                  |
+---------------------------------------------------------------------------+---------------------------------------------------------------------------+
|Die Hard                                                                   |2019-07-09 01:00:00                                                        |
|Die Hard                                                                   |2019-07-09 05:00:00                                                        |
|Die Hard                                                                   |2019-07-09 07:00:00                                                        |
|Tree of Life                                                               |2019-07-09 09:00:00                                                        |
|Tree of Life                                                               |2019-07-09 08:00:00                                                        |
|A Walk in the Clouds                                                       |2019-07-09 12:00:00                                                        |
|A Walk in the Clouds                                                       |2019-07-09 15:00:00                                                        |
|A Walk in the Clouds                                                       |2019-07-09 22:00:00                                                        |
|The Big Lebowski                                                           |2019-07-09 05:00:00                                                        |
|The Big Lebowski                                                           |2019-07-09 02:00:00                                                        |
|Super Mario Bros.                                                          |2019-07-09 18:00:00                                                        |
|Super Mario Bros.                                                          |2019-07-09 22:00:00                                                        |
|Super Mario Bros.                                                          |2019-07-10 00:01:00                                                        |
|Super Mario Bros.                                                          |2019-07-10 12:00:00                                                        |
|Super Mario Bros.                                                          |2019-07-09 22:01:00                                                        |
|Super Mario Bros.                                                          |2019-07-09 22:02:00                                                        |
|Super Mario Bros.                                                          |2019-07-09 22:00:00                                                        |
|Super Mario Bros.                                                          |2019-07-10 00:01:00                                                        |
|Super Mario Bros.                                                          |2019-07-10 12:00:00                                                        |
|Super Mario Bros.                                                          |2019-07-09 22:02:00                                                        |
|Super Mario Bros.                                                          |2019-07-09 22:02:00                                                        |
|Super Mario Bros.                                                          |2019-07-09 22:02:00                                                        |
|Die Hard                                                                   |2019-07-09 01:00:00                                                        |
|Die Hard                                                                   |2019-07-09 05:00:00                                                        |
|Die Hard                                                                   |2019-07-09 07:00:00                                                        |
|Tree of Life                                                               |2019-07-09 09:00:00                                                        |
|Tree of Life                                                               |2019-07-09 08:00:00                                                        |
|A Walk in the Clouds                                                       |2019-07-09 12:00:00                                                        |
|A Walk in the Clouds                                                       |2019-07-09 15:00:00                                                        |
|A Walk in the Clouds                                                       |2019-07-09 22:00:00                                                        |
|The Big Lebowski                                                           |2019-07-09 05:00:00                                                        |
|The Big Lebowski                                                           |2019-07-09 02:00:00                                                        |
|Super Mario Bros.                                                          |2019-07-09 18:00:00                                                        |
|Super Mario Bros.                                                          |2019-07-09 22:00:00                                                        |
|Super Mario Bros.                                                          |2019-07-09 22:01:00                                                        |
|Super Mario Bros.                                                          |2019-07-10 00:01:00                                                        |
|Super Mario Bros.                                                          |2019-07-10 00:01:00                                                        |
|Super Mario Bros.                                                          |2019-07-09 22:00:00                                                        |
|Super Mario Bros.                                                          |2019-07-09 21:00:00                                                        |
|Super Mario Bros.                                                          |2019-07-09 21:00:00                                                        |
|Super Mario Bros.                                                          |2019-07-10 21:00:00                                                        |
|Super Mario Bros.                                                          |2019-07-09 21:01:00                                                        |
|Super Mario Bros.                                                          |2019-07-09 21:01:00                                                        |
|Super Mario Bros.                                                          |2019-07-09 21:01:00                                                        |
|Super Mario Bros.                                                          |2020-02-16 21:01:00                                                        |
|Super Mario Bros.                                                          |2020-01-16 21:01:00                                                        |
|Super Mario Bros.                                                          |2020-02-15 21:01:00                                                        |
|Super Mario Bros.                                                          |2020-02-15 21:01:00                                                        |
|Super Mario Bros.                                                          |2020-02-16 21:02:00                                                        |
|Super Mario Bros.                                                          |2020-02-15 21:02:00                                                        |
|Super Mario Bros.                                                          |2020-02-15 21:02:00                                                        |
|Super Mario Bros.                                                          |2020-02-15 21:02:00                                                        |
|Super Mario Bros.                                                          |2020-02-15 21:02:00                                                        |
|Super Mario Bros.                                                          |2020-02-15 21:03:00                                                        |
|Super Mario Bros.                                                          |2020-02-16 21:02:00                                                        |
|Super Mario Bros.                                                          |2020-02-15 21:03:00                                                        |
|Super Mario Bros.                                                          |2020-02-15 21:04:00                                                        |
|Super Mario Bros.                                                          |2020-02-15 21:05:00                                                        |
|Super Mario Bros.                                                          |2020-02-15 21:05:00                                                        |
|Super Mario Bros.                                                          |2020-02-15 21:05:00                                                        |
|Super Mario Bros.                                                          |2020-02-15 21:05:00                                                        |
|Super Mario Bros.                                                          |2020-02-15 21:05:00                                                        |
|Super Mario Bros.                                                          |2020-02-15 21:05:00                                                        |
|Super Mario Bros.                                                          |2019-07-10 21:00:00                                                        |
|Super Mario Bros.                                                          |2019-07-10 21:00:00                                                        |
|Super Mario Bros.                                                          |2020-02-15 21:06:00                                                        |
|Super Mario Bros.                                                          |2020-02-15 21:06:00                                                        |
|Super Mario Bros.                                                          |2019-07-10 21:00:00                                                        |
|Super Mario Bros.                                                          |2019-07-10 21:01:00                                                        |
|Super Mario Bros.                                                          |2019-07-10 21:01:00                                                        |
|Super Mario Bros.                                                          |2020-02-15 21:06:00                                                        |
|Super Mario Bros.                                                          |2019-07-10 21:02:00                                                        |
|Super Mario Bros.                                                          |2019-07-10 21:02:00                                                        |
|Super Mario Bros.                                                          |2019-07-10 21:02:00                                                        |
|Super Mario Bros.                                                          |2019-07-10 21:02:00                                                        |
|Super Mario Bros.                                                          |2019-07-10 21:02:00                                                        |
|Die Hard                                                                   |2019-07-09 01:00:00                                                        |
|Die Hard                                                                   |2019-07-09 05:00:00                                                        |
|Die Hard                                                                   |2019-07-09 07:00:00                                                        |
|Tree of Life                                                               |2019-07-09 09:00:00                                                        |
|Tree of Life                                                               |2019-07-09 08:00:00                                                        |
|A Walk in the Clouds                                                       |2019-07-09 12:00:00                                                        |
|A Walk in the Clouds                                                       |2019-07-09 15:00:00                                                        |
|A Walk in the Clouds                                                       |2019-07-09 22:00:00                                                        |
|The Big Lebowski                                                           |2019-07-09 05:00:00                                                        |
|The Big Lebowski                                                           |2019-07-09 02:00:00                                                        |
|Super Mario Bros.                                                          |2019-07-09 18:00:00                                                        |
|Die Hard                                                                   |2019-07-10 01:00:00                                                        |
|Die Hard                                                                   |2020-02-01 01:00:00                                                        |
|Die Hard                                                                   |2020-02-01 01:00:00                                                        |
|Die Hard                                                                   |2020-02-01 04:00:00                                                        |
|Die Hard                                                                   |2020-02-01 22:10:00                                                        |
|Die Hard                                                                   |2020-02-01 22:10:00                                                        |
|Die Hard                                                                   |2020-02-01 22:10:00                                                        |
|Die Hard                                                                   |2020-02-01 22:11:00                                                        |
|Die Hard                                                                   |2020-02-01 22:11:00                                                        |
|Die Hard                                                                   |2020-02-01 22:01:00                                                        |
|Die Hard                                                                   |2020-02-01 22:01:00                                                        |
|Die Hard                                                                   |2020-02-01 22:02:00                                                        |
|Die Hard                                                                   |2020-02-01 22:02:00                                                        |
|Die Hard                                                                   |2020-02-01 22:02:00                                                        |
|Die Hard                                                                   |2020-02-01 22:03:00                                                        |
|Die Hard                                                                   |2020-02-15 22:03:00                                                        |
|Die Hard                                                                   |2020-02-15 22:04:00                                                        |
|Die Hard                                                                   |2020-02-01 22:03:00                                                        |
|Super Mario Bros.                                                          |2019-07-10 21:02:00                                                        |
|Super Mario Bros.                                                          |2019-07-10 21:02:00                                                        |
|Super Mario Bros.                                                          |2019-07-10 21:03:00                                                        |
|Chris                                                                      |2019-07-10 21:03:00                                                        |
|Chris                                                                      |2019-07-10 21:03:00                                                        |
|Chris                                                                      |2019-07-10 21:04:00                                                        |
|Chris                                                                      |2019-07-10 21:04:00                                                        |
|Chris                                                                      |2019-07-10 21:04:00                                                        |
|Super Mario Bros.                                                          |2019-07-10 21:03:00                                                        |
|Super Mario Bros.                                                          |2019-07-10 21:04:00                                                        |
|Super Mario Bros.                                                          |2019-07-10 20:04:00                                                        |
|Super Mario Bros.                                                          |2019-07-10 19:04:00                                                        |
|Super Mario Bros.                                                          |2019-07-11 19:04:00                                                        |
|Super Mario Bros.                                                          |2019-07-10 19:04:00                                                        |
|Super Mario Bros.                                                          |2019-07-10 19:04:00                                                        |

Now create a windowed table

CREATE TABLE rating_count_24
>    WITH (kafka_topic='rating_count_24') AS
>    SELECT title,
>           COUNT(*) AS rating_count,
>           WINDOWSTART() AS window_start,
>           WINDOWEND() AS window_end
>    FROM ratings
>    WINDOW TUMBLING (SIZE 24 HOURS)
>    GROUP BY title;

Actual behaviour
and select everything:

ksql> SELECT title,
>       rating_count,
>       TIMESTAMPTOSTRING(window_start, 'yyyy-MM-dd HH:mm:ss'),
>       TIMESTAMPTOSTRING(window_end, 'yyyy-MM-dd HH:mm:ss')
>FROM rating_count_24
>emit changes;
+------------------------------------+------------------------------------+------------------------------------+------------------------------------+
|TITLE                               |RATING_COUNT                        |KSQL_COL_2                          |KSQL_COL_3                          |
+------------------------------------+------------------------------------+------------------------------------+------------------------------------+
|Die Hard                            |3                                   |2019-07-09 00:00:00                 |2019-07-10 00:00:00                 |
|Tree of Life                        |2                                   |2019-07-09 00:00:00                 |2019-07-10 00:00:00                 |
|A Walk in the Clouds                |3                                   |2019-07-09 00:00:00                 |2019-07-10 00:00:00                 |
|The Big Lebowski                    |2                                   |2019-07-09 00:00:00                 |2019-07-10 00:00:00                 |
|Super Mario Bros.                   |2                                   |2019-07-09 00:00:00                 |2019-07-10 00:00:00                 |
|Super Mario Bros.                   |7                                   |2019-07-10 00:00:00                 |2019-07-11 00:00:00                 |
|Super Mario Bros.                   |3                                   |2020-02-16 00:00:00                 |2020-02-17 00:00:00                 |

There is quite some data missing, e.g. everything with "Chris".

This is the same for 8 hour windows as well. The old records simply don't make it into the window, if the window is in the "yesterday" time frame.

Expected behavior
All old data should be there. Especially since the message timestamps in the ratings input topic are current and nothing is deleted on the broker.

Additional context
I managed to get debug logging on ksql-server. We often see
DEBUG Skipping record for expired segment. (org.apache.kafka.streams.state.internals.AbstractRocksDBSegmentedBytesStore)
but the logging is not very deep, so we don't know which records were skipped, in the log.

Changelog topic is automatically set to
cleanup.policy | compact,delete
retention.ms | 30086400000

output table sink topic has broker default (700 days)

@k1th k1th added the bug label Feb 17, 2020
@k1th k1th changed the title 24 hour windows missing data from input topic (RocksDB "old" segments?) CTAS windows missing data from input topic (RocksDB "old" segments?) Feb 17, 2020
@k1th
Copy link
Author

k1th commented Feb 17, 2020

#4363

@k1th
Copy link
Author

k1th commented Feb 18, 2020

Interesting as well is this behavior:
Start a select on the table:

ksql> SELECT title,
>       rating_count,
>       TIMESTAMPTOSTRING(window_start, 'yyyy-MM-dd HH:mm:ss'),
>       TIMESTAMPTOSTRING(window_end, 'yyyy-MM-dd HH:mm:ss')
>FROM rating_count_24

Insert some data for today:
INSERT INTO ratings (title, release_year, rating, timestamp) VALUES ('Chris_now', 1993, 3.5, '2020-02-18 00:00:01');

result shows up in running query
|Chris_now |5 |2020-02-18 00:00:00 |2020-02-19 00:00:00 |

insert some data for yesterday (= the previous window):
ksql> INSERT INTO ratings (title, release_year, rating, timestamp) VALUES ('Chris_now', 1993, 3.5, '2020-02-17 22:59:59');
nothing shows up in the running query, nothing is persisted into the changelog topic, but the inserted record is in the inbound ratings topic and the streams client backing the table has no lag.

I thought late-arriving records should be added to a window even if it's in the past. In my case here I am inserting at 17:00 into yesterday's window…

@k1th k1th changed the title CTAS windows missing data from input topic (RocksDB "old" segments?) CTAS windows missing data from input topic with 24 hour windows Feb 18, 2020
@k1th
Copy link
Author

k1th commented Feb 19, 2020

figured out where the missing data is by setting DEBUG on org.apache.kafka.streams.kstream.internals:

Skipping record for expired window. key=[Struct{ROWKEY=Chris_now}] topic=[ratings] partition=[0] offset=[219] timestamp=[1581940800000] window=[1581897600000,1581984000000) expiration=[1582027200000] streamTime=[1582027200000] (org.apache.kafka.streams.kstream.internals.KStreamWindowAggregate)
ksql-server_1            | [2020-02-19 17:07:14,804] DEBUG Skipping record for expired window. key=[Struct{ROWKEY=Chris_now}] topic=[ratings] partition=[0] offset=[219] timestamp=[1581940800000] window=[1581897600000,1581984000000) expiration=[1582027200000] streamTime=[1582027200000] (org.apache.kafka.streams.kstream.internals.KStreamWindowAggregate)
ksql-server_1            | [2020-02-19 17:07:14,804] DEBUG Skipping record for expired window. key=[Struct{ROWKEY=Chris_now}] topic=[ratings] partition=[0] offset=[219] timestamp=[1581940800000] window=[1581897600000,1581984000000) expiration=[1582027200000] streamTime=[1582027200000] (org.apache.kafka.streams.kstream.internals.KStreamWindowAggregate)

This is related to https://issues.apache.org/jira/browse/KAFKA-8769 and cannot be controlled via ksql, because we cannot set windows.gracePeriodMs

KStreamWindowAggregate sets the observed time for the current partition to:

            observedStreamTime = Math.max(observedStreamTime, timestamp);
            final long closeTime = observedStreamTime - windows.gracePeriodMs();

This means, inserting any older record for the same key does not end up in the window, if not
(windowEnd > closeTime)

This creates a bunch of problems for us, when entering a new window at 00:00 and records coming in for the past window in the same partition. This does not work:

ksql> INSERT INTO ratings (title, release_year, rating, timestamp) VALUES ('totall_new_key', 1993, 3.5, '2020-02-19 00:01:00');
ksql> INSERT INTO ratings (title, release_year, rating, timestamp) VALUES ('totall_new_key', 1993, 3.5, '2020-02-18 23:59:00');

The second record is discarded:

ksql-server_1            | [2020-02-19 17:13:06,547] DEBUG Skipping record for expired window. key=[Struct{ROWKEY=totall_new_key}] topic=[ratings] partition=[0] offset=[222] timestamp=[1582070340000] window=[1581984000000,1582070400000) expiration=[1582070460000] streamTime=[1582070460000] (org.apache.kafka.streams.kstream.internals.KStreamWindowAggregate)

@k1th k1th changed the title CTAS windows missing data from input topic with 24 hour windows CTAS windows missing data when time already advanced to new window and inserting late-arriving records. Feb 19, 2020
@k1th
Copy link
Author

k1th commented Mar 11, 2020

closing, because issue will probably be resolved with #4733

@k1th k1th closed this as completed Mar 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant