Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix dynamic ppss start date causes issues #1095

Conversation

KatunaNorbert
Copy link
Member

@KatunaNorbert KatunaNorbert commented May 28, 2024

Fixes #1094.

DoD:

  • Figure out st_ts and end_ts before calling the pipeline. Then, call the pipeline with a fixed st_ts and end_ts, such that these parameters DO NOT CHANGE during the course of the ETL pipeline.
  • Add checks/validations/asserts that (end-ts - st_ts) > 1 day. Anything below this, is a horrible configuration. If subgraph goes down for a day, this configuration fails.

@idiom-bytes
Copy link
Member

idiom-bytes commented May 28, 2024

Hi @KatunaNorbert,

Rather than "fixing" the problem w/ a big system-level change and no-test...

[Requirements]
I want you to please create a test that shows me why/how the problem is happening.
Please use with/patch and the fixture from test_etl.py to try and isolate the issue.

Example: Based on your update, I'm led to believe that this problem is happening both at the Raw + ETL Table level since you're changing both GQL & ETL code...

However, i can't even understand the problem...

Because lake_ss start timestamp value could be dynamic by now

I think what you mean is that when you have ppss configured like this...
lake_ss.st_ts = "1 day ago"
lake_ss.end_ts = "now"

You end up getting a "dynamic start timestamp". Thus, I need you to improve your ticket/example/test/coverage/proposed solution and provide better test/example/showing me how this problem is now solved.

[Overview]
In my mental model, the functions we use in GQL and ETL to calculate st_ts and end_ts should work.
@ gql_data_factory.py - LoC 291, 296
@ etl.py - LoC 213

Whether you use:
st_ts: "05-01-2024"
st_ts: "28d ago"

Should both resolve to the checkpoint/logic/tables processed. No rows should be dropped, no duplicate/extra rows should be created.

"predictions" and "bronze_predictions" should maintain the same number of rows.

[Avoid tight lookback windows]
As long as we don't make the st_ts <-> end_ts window too small "1h ago <-> now", I believe the pipeline should keep running and maintaining all tables up-to-date.

Preferably, we can put a an `assert (end_ts - st_ts) > "7d", lookback window is too small" such that we can maintain a healthy window and avoid having to add defensive code around weird ppss.yaml configurations

Example: We often have problems with subgraphs going down and many APIs/external data providers can run into problems. Having too tight of a value in st_ts isn't advisable.

[Anchoring to last_ts from CSV/GQL]
The other way to deal with this, is to anchor st_ts to the last_timestamp across csv/raw files, just like GQL does right now... again, the reason why I don't understand why this is failig...

If in GQLDF st_ts is greater than head of csv, then the pipeline should resume from the head of CSV...same with table... in such a way that there are no gaps or duplicates created.


for table in (
TableRegistry().get_tables(self.record_config["gql_tables"]).values()
):
# calculate start and end timestamps
st_ut = self._calc_start_ut(table)
st_ut = self._calc_start_ut(table, start_ut_ppss)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this makes sense, but i don't understand how it fixes the problem you described in the ticket

further, you didn't provide any test or other setup showing me how this fixes the problem

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It fixes the problem by having a constat start timestamp across all the tables while the fetch process is running instead of initiating it at each table level which causes different values.

I have not provided a test because It takes a lot of time and we should validate the approach before implementing the solution e-e and then just ditching it

from_timestamp = self.get_timestamp_values(
self.bronze_table_names, self.ppss.lake_ss.st_timestr
st_timestr = self._get_min_timestamp_values_from(
[NamedTable(tb) for tb in self.gql_data_factory.raw_table_names]
Copy link
Member

@idiom-bytes idiom-bytes May 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case of the first run

Where there are no bronze tables) then you should be using ppss.lake_ss.st_ts because that's the exact timestamp to start. This will end up being very close if not the same as self._get_min_timestamp_values_from... but you should not use your function because it's redundant/extra work and you should use ppss.lake_ss.st_ts as the starting point when the lake is empty (read: first run).

In the case of the second/third run

When you actually start experiencing the problem, you'll be many runs into the ETL (i.e. run number 3), and you're saying that the start time should be (in case you do not find the latest from bronze tables) the min_timestamp from raw_tables...

image

Clearly this doesn't make any sense and it's obviously not what you want to do.

Focus on a test that shows the current system not working

Please focus on creating a test that shows the problem manifesting itself.
Then prove the fix works by making the test pass.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First run:
Using the ppss.lake_ss.st_ts is the exact problem because it can either be fix or dynamic

  • if the values is a fixed date than not problem, all tables are getting the same start date
  • if value is dynamic, like 18d ago than the value is different whenever you use it and you will end up with different start dates.
    Ex: gql start at 2024.05.19_11:00 -> st_date 2024.05.1_11:00
    etl start at 2024.05.19_11:10 -> st_date 2024.05.1_11:10
    -> 10min gap between the start dates

Second/third:
You got it wrong, the start date is read from ppss just for the first run when there are no rows in the bronze tables, in all the other cases it gets the start time from the bronze tables.
This was not changed, the only thing changes is that on the first run when there is no data in the bronze tables it gets the start date from raw tables instead of ppss, and the reason is explain above ( the ppss can be dynamic )

@KatunaNorbert
Copy link
Member Author

KatunaNorbert commented May 29, 2024

Added test to show how ppss.lake_ss.st_timestr could be dynamic and have different values in different places we use it

@idiom-bytes
Copy link
Member

idiom-bytes commented May 29, 2024

Added test to show how ppss.lake_ss.st_timestr could be dynamic and have different values in different places we use it
thank you, i see the issue now

TLDR; "we need st_ts and end_ts to be consistent across the whole GQL + ETL execution"

[Example]
basically, because we do not maintain a checkpoint, time shifts as the pipeline runs

when gql_starts
now = 1

when etl_starts
now = 2

when etl_ends
now = 3

[Zoom Out]
It's the same problem that was happening in ETL, but now across the whole pipeline.

[Simple Solution]
Just resolve st_ts and end_ts into a fixed timestamp than then gets passed in throughout the whole flow... example...

image

@KatunaNorbert
Copy link
Member Author

KatunaNorbert commented May 29, 2024

[Simple Solution] Just resolve st_ts and end_ts into a fixed timestamp than then gets passed in throughout the whole flow... example...

What if I run lake raw update and then lake etl update, or drop etl tables and run etl update again?

@idiom-bytes
Copy link
Member

idiom-bytes commented May 29, 2024

@KatunaNorbert @calina-c

The pipeline interface expects you to give it a fixed st_ts + end_ts and the system was modified in such a way where these parameters can change...

The problem isn't that the data pipeline doesn't support a moving checkpoint (it will to a certain extent and this will break too...) it's that things changed in such a way where ppss configuration can break the pipeline.

All the parameters above you mentioned should work as long as you don't get any errors...
st_ts = "1d ago"
end_ts = "now"

Overall you're testing functionality that is edge case/bad configuration...

DoD:

  • Figure out st_ts and end_ts before calling the pipeline. Then, call the pipeline with a fixed st_ts and end_ts, such that these parameters DO NOT CHANGE during the course of the ETL pipeline.
  • Add checks/validations/asserts that (end-ts - st_ts) > 1 day. Anything below this, is a horrible configuration. If subgraph goes down for a day, this configuration fails.

@KatunaNorbert
Copy link
Member Author

KatunaNorbert commented May 30, 2024

Overall you're testing functionality that is edge case/bad configuration...

For me it doesn't sound like an edge case when we have commands that can only run the GQL updates and then the ETL update separately and same with drop and you insisted on doing those steps while testing, and now you are saying those are edge cases and we shouldn't test..

If we modify/fix to have fix dates trough the ETL update process, there are still going to be cases when it won't work, but it will If we only do ETL updates( just raising expectations )

@idiom-bytes
Copy link
Member

WRT the core issue, I'm going to create a separate PR so we can try to close this off.

WRT error on CLI and the discussion below, I'll create a separate ticket.

For me it doesn't sound like an edge case

Can you please be more specific as to why you don't think this is an edge case/bad configuration?

I'm arguing that we should not support anything less than a day. It doesn't even make sense for what our system is trying to do (predict the future).

Although the system will still work (because it will start from the last_record available > st_ts), trying to support anything below 1d lookback on ppss.yaml doesn't sound like a productive idea.

ppss.lake_ss.st_ts = "1d ago"
ppss.lake_ss.end_ts = "now"

@trentmc can you please provide some guidance?

@idiom-bytes
Copy link
Member

idiom-bytes commented May 30, 2024

I have now fixed it here.. simply by resolving the "natural language" date on the outside loop of GQL/ETL/others...
#1106

image

This forces a fixed date range to be used throughout the lifetime of the command.
image

I'll adjust this again when we have a thread/loop working....and have crated these two tickets for the future, when i implement a thread/loop.
#1105
#1107

@calina-c
you could take #1105 now as it's agnostic of DuckDB

Closing this PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants