-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix dynamic ppss start date causes issues #1095
Fix dynamic ppss start date causes issues #1095
Conversation
Hi @KatunaNorbert, Rather than "fixing" the problem w/ a big system-level change and no-test... [Requirements] Example: Based on your update, I'm led to believe that this problem is happening both at the Raw + ETL Table level since you're changing both GQL & ETL code... However, i can't even understand the problem...
I think what you mean is that when you have ppss configured like this... You end up getting a "dynamic start timestamp". Thus, I need you to improve your ticket/example/test/coverage/proposed solution and provide better test/example/showing me how this problem is now solved. [Overview] Whether you use: Should both resolve to the checkpoint/logic/tables processed. No rows should be dropped, no duplicate/extra rows should be created. "predictions" and "bronze_predictions" should maintain the same number of rows. [Avoid tight lookback windows] Preferably, we can put a an `assert (end_ts - st_ts) > "7d", lookback window is too small" such that we can maintain a healthy window and avoid having to add defensive code around weird ppss.yaml configurations Example: We often have problems with subgraphs going down and many APIs/external data providers can run into problems. Having too tight of a value in st_ts isn't advisable. [Anchoring to last_ts from CSV/GQL] If in GQLDF st_ts is greater than head of csv, then the pipeline should resume from the head of CSV...same with table... in such a way that there are no gaps or duplicates created. |
|
||
for table in ( | ||
TableRegistry().get_tables(self.record_config["gql_tables"]).values() | ||
): | ||
# calculate start and end timestamps | ||
st_ut = self._calc_start_ut(table) | ||
st_ut = self._calc_start_ut(table, start_ut_ppss) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think this makes sense, but i don't understand how it fixes the problem you described in the ticket
further, you didn't provide any test or other setup showing me how this fixes the problem
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It fixes the problem by having a constat start timestamp across all the tables while the fetch process is running instead of initiating it at each table level which causes different values.
I have not provided a test because It takes a lot of time and we should validate the approach before implementing the solution e-e and then just ditching it
from_timestamp = self.get_timestamp_values( | ||
self.bronze_table_names, self.ppss.lake_ss.st_timestr | ||
st_timestr = self._get_min_timestamp_values_from( | ||
[NamedTable(tb) for tb in self.gql_data_factory.raw_table_names] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the case of the first run
Where there are no bronze tables) then you should be using ppss.lake_ss.st_ts
because that's the exact timestamp to start. This will end up being very close if not the same as self._get_min_timestamp_values_from
... but you should not use your function because it's redundant/extra work and you should use ppss.lake_ss.st_ts
as the starting point when the lake is empty (read: first run).
In the case of the second/third run
When you actually start experiencing the problem, you'll be many runs into the ETL (i.e. run number 3), and you're saying that the start time should be (in case you do not find the latest from bronze tables) the min_timestamp from raw_tables...
Clearly this doesn't make any sense and it's obviously not what you want to do.
Focus on a test that shows the current system not working
Please focus on creating a test that shows the problem manifesting itself.
Then prove the fix works by making the test pass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First run:
Using the ppss.lake_ss.st_ts
is the exact problem because it can either be fix or dynamic
- if the values is a fixed date than not problem, all tables are getting the same start date
- if value is dynamic, like 18d ago than the value is different whenever you use it and you will end up with different start dates.
Ex: gql start at 2024.05.19_11:00 -> st_date 2024.05.1_11:00
etl start at 2024.05.19_11:10 -> st_date 2024.05.1_11:10
-> 10min gap between the start dates
Second/third:
You got it wrong, the start date is read from ppss just for the first run when there are no rows in the bronze tables, in all the other cases it gets the start time from the bronze tables.
This was not changed, the only thing changes is that on the first run when there is no data in the bronze tables it gets the start date from raw tables instead of ppss, and the reason is explain above ( the ppss can be dynamic )
Added test to show how ppss.lake_ss.st_timestr could be dynamic and have different values in different places we use it |
TLDR; "we need st_ts and end_ts to be consistent across the whole GQL + ETL execution" [Example] when gql_starts when etl_starts when etl_ends [Zoom Out] [Simple Solution] |
What if I run |
The pipeline interface expects you to give it a fixed st_ts + end_ts and the system was modified in such a way where these parameters can change... The problem isn't that the data pipeline doesn't support a moving checkpoint (it will to a certain extent and this will break too...) it's that things changed in such a way where ppss configuration can break the pipeline. All the parameters above you mentioned should work as long as you don't get any errors... Overall you're testing functionality that is edge case/bad configuration... DoD:
|
For me it doesn't sound like an edge case when we have commands that can only run the GQL updates and then the ETL update separately and same with drop and you insisted on doing those steps while testing, and now you are saying those are edge cases and we shouldn't test.. If we modify/fix to have fix dates trough the ETL update process, there are still going to be cases when it won't work, but it will If we only do ETL updates( just raising expectations ) |
WRT the core issue, I'm going to create a separate PR so we can try to close this off. WRT error on CLI and the discussion below, I'll create a separate ticket.
Can you please be more specific as to why you don't think this is an edge case/bad configuration? I'm arguing that we should not support anything less than a day. It doesn't even make sense for what our system is trying to do (predict the future). Although the system will still work (because it will start from the last_record available > st_ts), trying to support anything below 1d lookback on ppss.yaml doesn't sound like a productive idea.
@trentmc can you please provide some guidance? |
I have now fixed it here.. simply by resolving the "natural language" date on the outside loop of GQL/ETL/others... This forces a fixed date range to be used throughout the lifetime of the command. I'll adjust this again when we have a thread/loop working....and have crated these two tickets for the future, when i implement a thread/loop. @calina-c Closing this PR |
Fixes #1094.
DoD: