Fix dynamic ppss start date causes issues #1095

KatunaNorbert · 2024-05-28T11:09:18Z

DoD:

Figure out st_ts and end_ts before calling the pipeline. Then, call the pipeline with a fixed st_ts and end_ts, such that these parameters DO NOT CHANGE during the course of the ETL pipeline.
Add checks/validations/asserts that (end-ts - st_ts) > 1 day. Anything below this, is a horrible configuration. If subgraph goes down for a day, this configuration fails.

idiom-bytes · 2024-05-28T23:09:06Z

Rather than "fixing" the problem w/ a big system-level change and no-test...

[Requirements]
I want you to please create a test that shows me why/how the problem is happening.
Please use with/patch and the fixture from test_etl.py to try and isolate the issue.

Example: Based on your update, I'm led to believe that this problem is happening both at the Raw + ETL Table level since you're changing both GQL & ETL code...

However, i can't even understand the problem...

Because lake_ss start timestamp value could be dynamic by now

I think what you mean is that when you have ppss configured like this...
lake_ss.st_ts = "1 day ago"
lake_ss.end_ts = "now"

You end up getting a "dynamic start timestamp". Thus, I need you to improve your ticket/example/test/coverage/proposed solution and provide better test/example/showing me how this problem is now solved.

[Overview]
In my mental model, the functions we use in GQL and ETL to calculate st_ts and end_ts should work.
@ gql_data_factory.py - LoC 291, 296
@ etl.py - LoC 213

Whether you use:
st_ts: "05-01-2024"
st_ts: "28d ago"

Should both resolve to the checkpoint/logic/tables processed. No rows should be dropped, no duplicate/extra rows should be created.

"predictions" and "bronze_predictions" should maintain the same number of rows.

[Avoid tight lookback windows]
As long as we don't make the st_ts <-> end_ts window too small "1h ago <-> now", I believe the pipeline should keep running and maintaining all tables up-to-date.

Preferably, we can put a an `assert (end_ts - st_ts) > "7d", lookback window is too small" such that we can maintain a healthy window and avoid having to add defensive code around weird ppss.yaml configurations

Example: We often have problems with subgraphs going down and many APIs/external data providers can run into problems. Having too tight of a value in st_ts isn't advisable.

[Anchoring to last_ts from CSV/GQL]
The other way to deal with this, is to anchor st_ts to the last_timestamp across csv/raw files, just like GQL does right now... again, the reason why I don't understand why this is failig...

If in GQLDF st_ts is greater than head of csv, then the pipeline should resume from the head of CSV...same with table... in such a way that there are no gaps or duplicates created.

idiom-bytes · 2024-05-28T23:30:57Z

pdr_backend/lake/gql_data_factory.py


        for table in (
            TableRegistry().get_tables(self.record_config["gql_tables"]).values()
        ):
            # calculate start and end timestamps
-            st_ut = self._calc_start_ut(table)
+            st_ut = self._calc_start_ut(table, start_ut_ppss)


i think this makes sense, but i don't understand how it fixes the problem you described in the ticket

further, you didn't provide any test or other setup showing me how this fixes the problem

It fixes the problem by having a constat start timestamp across all the tables while the fetch process is running instead of initiating it at each table level which causes different values.

I have not provided a test because It takes a lot of time and we should validate the approach before implementing the solution e-e and then just ditching it

idiom-bytes · 2024-05-28T23:33:50Z

pdr_backend/lake/etl.py

-        from_timestamp = self.get_timestamp_values(
-            self.bronze_table_names, self.ppss.lake_ss.st_timestr
+        st_timestr = self._get_min_timestamp_values_from(
+            [NamedTable(tb) for tb in self.gql_data_factory.raw_table_names]


In the case of the first run

Where there are no bronze tables) then you should be using ppss.lake_ss.st_ts because that's the exact timestamp to start. This will end up being very close if not the same as self._get_min_timestamp_values_from... but you should not use your function because it's redundant/extra work and you should use ppss.lake_ss.st_ts as the starting point when the lake is empty (read: first run).

In the case of the second/third run

When you actually start experiencing the problem, you'll be many runs into the ETL (i.e. run number 3), and you're saying that the start time should be (in case you do not find the latest from bronze tables) the min_timestamp from raw_tables...

Clearly this doesn't make any sense and it's obviously not what you want to do.

Focus on a test that shows the current system not working

Please focus on creating a test that shows the problem manifesting itself.
Then prove the fix works by making the test pass.

First run:
Using the ppss.lake_ss.st_ts is the exact problem because it can either be fix or dynamic

if the values is a fixed date than not problem, all tables are getting the same start date

if value is dynamic, like 18d ago than the value is different whenever you use it and you will end up with different start dates.
Ex: gql start at 2024.05.19_11:00 -> st_date 2024.05.1_11:00
etl start at 2024.05.19_11:10 -> st_date 2024.05.1_11:10
-> 10min gap between the start dates

Second/third:
You got it wrong, the start date is read from ppss just for the first run when there are no rows in the bronze tables, in all the other cases it gets the start time from the bronze tables.
This was not changed, the only thing changes is that on the first run when there is no data in the bronze tables it gets the start date from raw tables instead of ppss, and the reason is explain above ( the ppss can be dynamic )

KatunaNorbert · 2024-05-29T13:05:50Z

Added test to show how ppss.lake_ss.st_timestr could be dynamic and have different values in different places we use it

idiom-bytes · 2024-05-29T17:14:45Z

Added test to show how ppss.lake_ss.st_timestr could be dynamic and have different values in different places we use it
thank you, i see the issue now

TLDR; "we need st_ts and end_ts to be consistent across the whole GQL + ETL execution"

[Example]
basically, because we do not maintain a checkpoint, time shifts as the pipeline runs

when gql_starts
now = 1

when etl_starts
now = 2

when etl_ends
now = 3

[Zoom Out]
It's the same problem that was happening in ETL, but now across the whole pipeline.

[Simple Solution]
Just resolve st_ts and end_ts into a fixed timestamp than then gets passed in throughout the whole flow... example...

KatunaNorbert · 2024-05-29T18:32:00Z

[Simple Solution] Just resolve st_ts and end_ts into a fixed timestamp than then gets passed in throughout the whole flow... example...

What if I run lake raw update and then lake etl update, or drop etl tables and run etl update again?

idiom-bytes · 2024-05-29T23:25:21Z

@KatunaNorbert @calina-c

The pipeline interface expects you to give it a fixed st_ts + end_ts and the system was modified in such a way where these parameters can change...

The problem isn't that the data pipeline doesn't support a moving checkpoint (it will to a certain extent and this will break too...) it's that things changed in such a way where ppss configuration can break the pipeline.

All the parameters above you mentioned should work as long as you don't get any errors...
st_ts = "1d ago"
end_ts = "now"

Overall you're testing functionality that is edge case/bad configuration...

DoD:

Figure out st_ts and end_ts before calling the pipeline. Then, call the pipeline with a fixed st_ts and end_ts, such that these parameters DO NOT CHANGE during the course of the ETL pipeline.
Add checks/validations/asserts that (end-ts - st_ts) > 1 day. Anything below this, is a horrible configuration. If subgraph goes down for a day, this configuration fails.

KatunaNorbert · 2024-05-30T07:04:39Z

Overall you're testing functionality that is edge case/bad configuration...

For me it doesn't sound like an edge case when we have commands that can only run the GQL updates and then the ETL update separately and same with drop and you insisted on doing those steps while testing, and now you are saying those are edge cases and we shouldn't test..

If we modify/fix to have fix dates trough the ETL update process, there are still going to be cases when it won't work, but it will If we only do ETL updates( just raising expectations )

idiom-bytes · 2024-05-30T16:02:18Z

WRT the core issue, I'm going to create a separate PR so we can try to close this off.

WRT error on CLI and the discussion below, I'll create a separate ticket.

For me it doesn't sound like an edge case

Can you please be more specific as to why you don't think this is an edge case/bad configuration?

I'm arguing that we should not support anything less than a day. It doesn't even make sense for what our system is trying to do (predict the future).

Although the system will still work (because it will start from the last_record available > st_ts), trying to support anything below 1d lookback on ppss.yaml doesn't sound like a productive idea.

ppss.lake_ss.st_ts = "1d ago"
ppss.lake_ss.end_ts = "now"

@trentmc can you please provide some guidance?

idiom-bytes · 2024-05-30T23:20:39Z

I have now fixed it here.. simply by resolving the "natural language" date on the outside loop of GQL/ETL/others...
#1106

This forces a fixed date range to be used throughout the lifetime of the command.

I'll adjust this again when we have a thread/loop working....and have crated these two tickets for the future, when i implement a thread/loop.
#1105
#1107

@calina-c
you could take #1105 now as it's agnostic of DuckDB

Closing this PR

KatunaNorbert added 2 commits May 28, 2024 13:54

get etl tables start date from raw tables instead of ppss

ae75183

make start timestamp from ppss contact in gql update

95fe73b

KatunaNorbert mentioned this pull request May 28, 2024

[Lake][Config] PredictoorETL is handling st_ts and end_ts dates correctly. #1086

Closed

2 tasks

idiom-bytes reviewed May 28, 2024

View reviewed changes

create test to show how ppss start time can be dynamic

2a066b0

idiom-bytes closed this May 30, 2024

idiom-bytes reopened this May 30, 2024

This was referenced May 30, 2024

[CLI, PPSS, LakeSS] Assert (end_ts - st_ts) > 1d #1105

Closed

[CLI][ETL][Update] CLI Lake RAW + ETL update() command can do one-shot or loop-forever #1107

Closed

idiom-bytes closed this May 30, 2024

idiom-bytes deleted the issue-1094-fix-dynamic-start-date branch May 30, 2024 23:25

idiom-bytes mentioned this pull request May 30, 2024

[Duckdb] Fix Issue1094: Resolve "natural language" st_ts/end_ts in ppss => to fixed values, as expected by lake/etl systems #1106

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix dynamic ppss start date causes issues #1095

Fix dynamic ppss start date causes issues #1095

KatunaNorbert commented May 28, 2024 •

edited by idiom-bytes

Loading

idiom-bytes commented May 28, 2024 •

edited

Loading

idiom-bytes May 28, 2024

KatunaNorbert May 29, 2024

idiom-bytes May 28, 2024 •

edited

Loading

KatunaNorbert May 29, 2024

KatunaNorbert commented May 29, 2024 •

edited

Loading

idiom-bytes commented May 29, 2024 •

edited

Loading

KatunaNorbert commented May 29, 2024 •

edited

Loading

idiom-bytes commented May 29, 2024 •

edited by KatunaNorbert

Loading

KatunaNorbert commented May 30, 2024 •

edited

Loading

idiom-bytes commented May 30, 2024

idiom-bytes commented May 30, 2024 •

edited

Loading

Fix dynamic ppss start date causes issues #1095

Fix dynamic ppss start date causes issues #1095

Conversation

KatunaNorbert commented May 28, 2024 • edited by idiom-bytes Loading

DoD:

idiom-bytes commented May 28, 2024 • edited Loading

idiom-bytes May 28, 2024

Choose a reason for hiding this comment

KatunaNorbert May 29, 2024

Choose a reason for hiding this comment

idiom-bytes May 28, 2024 • edited Loading

Choose a reason for hiding this comment

In the case of the first run

In the case of the second/third run

Focus on a test that shows the current system not working

KatunaNorbert May 29, 2024

Choose a reason for hiding this comment

KatunaNorbert commented May 29, 2024 • edited Loading

idiom-bytes commented May 29, 2024 • edited Loading

KatunaNorbert commented May 29, 2024 • edited Loading

idiom-bytes commented May 29, 2024 • edited by KatunaNorbert Loading

KatunaNorbert commented May 30, 2024 • edited Loading

idiom-bytes commented May 30, 2024

idiom-bytes commented May 30, 2024 • edited Loading

KatunaNorbert commented May 28, 2024 •

edited by idiom-bytes

Loading

idiom-bytes commented May 28, 2024 •

edited

Loading

idiom-bytes May 28, 2024 •

edited

Loading

KatunaNorbert commented May 29, 2024 •

edited

Loading

idiom-bytes commented May 29, 2024 •

edited

Loading

KatunaNorbert commented May 29, 2024 •

edited

Loading

idiom-bytes commented May 29, 2024 •

edited by KatunaNorbert

Loading

KatunaNorbert commented May 30, 2024 •

edited

Loading

idiom-bytes commented May 30, 2024 •

edited

Loading