Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Lake][Config] PredictoorETL is handling st_ts and end_ts dates correctly. #1086

Closed
2 tasks
idiom-bytes opened this issue May 23, 2024 · 5 comments
Closed
2 tasks
Assignees
Labels
Type: Enhancement New feature or request

Comments

@idiom-bytes
Copy link
Member

Background / motivation

This originates from scoping down & stabilizing the DuckDB/Raw/ETL data pipeline.
#1077

The problem exists because lake_ss.st_ts and lake_ss.end_ts are being continually used with time windows to solve for how OHLCV is typically used (with a lookback).

Example Time Window
st_ts: "1 month ago"
end_ts: "now"

If we want to solve for having a relative date in ppss.yaml (rather than use a fixed date for st_ts) then I propose we have different lake_ss strategies, such that we can separate the concern of the lake (to grow as big as possible) from the subsystems that are trying to consume/process/build off this data.

Challenge / Problem

The lake (and likely OHLCV) benefit from being "greedy" and always growing to obtain as much data as possible. So, it's silly for an example to have a Time Window of 1 month that would delete old data.

What would be best would be to separate these concerns into Lake<->System

Lake
|- Pond
|- Pond

[Lake]
Is always greedy, trying to grow as much as possible.

  • It should start from a fixed date (i.e. 01-01-2023)
  • It should end on the latest possible date (i.e. now)

[Pond]
Is a filter of Lake, trying to process whatever data from lake it's responsible for.

  • It could start from a relative date (i.e. 1 month ago)
  • It could end on a relative date (i.e. 1 day ago)

We generally do not want a lake with a moving tail.

Proposal A - Broad Lake - Narrow Pond - Improving ppss.yaml

Update lake_ss to be greedy.

  • No start relative dates
  • Fixed start dates
  • Relative end_dates are ok

Let model_ss use filtering

  • OHCLV/Model AI can have a relative start date
  • This data can be sampled from the larger lake

Example:

lake_ss:
 st_ts: 01-01-2023
 end_ts: "now"
pdr_etl_ss:
  st_ts: "01-01-2023"
  end_ts: "now"
ai_model_ss:
  st_ts: "last 1 month"
  end_ts: "now"

TODOs / DoD

  • Update yaml.css to support different lake/data strategies
  • lake_ss is responsible for owning/growing the lake
  • pdr_etl_ss and ai_model_ss then consume/use a subset of the lake

Tasks:

  • Update yaml to support pdr_etl vs. ai_model ss
  • Update lake to fill up based on top_level_ss rules
@idiom-bytes idiom-bytes added the Type: Enhancement New feature or request label May 23, 2024
@idiom-bytes idiom-bytes changed the title [Lake][Config] Lake is handling fixed and relative dates correctly. [Lake][Config] PredictoorETL is handling st_ts and end_ts dates correctly. May 23, 2024
@trentmc
Copy link
Member

trentmc commented May 24, 2024

If an absolute value start date is given in the yaml file, and there's already ohlcv csv files with earlier dates, the lake doesn't delete them. Why would it?

You can think of start and end date values as saying "I want at least this data in the lake". It's instructions if any extra data to gather. It is explicitly NOT saying "I want only this data in the lake, therefore delete anything that isn't there".

This carries over to relative dates too.

Therefore you don't need to delete old data. That would be overkill and over-engineering. KISS.

@trentmc
Copy link
Member

trentmc commented May 24, 2024

(Therefore we should keep relative dates for start date too. It helps ux.)

@KatunaNorbert
Copy link
Member

It makes sense for me and this is also crossed my mind before to have different config values for different data types inside the lake. This is going to make the fetching process + storing more efficient and will also help with overall configurations so now If you want to experiment with sim or predictions you update a specific configuration inside the lake, if you want to play with analytics you change another parameter and you can keep track much easier than just modifying one start_ts.

Related to the more specific ETL + GQL issue I think we can make some minor changes where the now is read at the start of the fetching process and is kept as constant until the fetch has ended. After first data fetching went trough the GQL is going to look at last saved data timestamp for fetching new values so there shouldn't be issues.
I created an issue and a fix for the proposed solution here: #1095

@idiom-bytes
Copy link
Member Author

idiom-bytes commented May 31, 2024

thank you @KatunaNorbert! 👍
please review when possible

I have implemented the immediate solution I'm looking for here #1106 and closed #1095

this addresses the issue in 1 place, across all lake commands, as intended and expected by GQL + ETL to work correctly.

@idiom-bytes idiom-bytes self-assigned this Jun 4, 2024
@idiom-bytes
Copy link
Member Author

I have completed all the work required to address this for the ETL build. st_ts and ed_ts are working as I expect and I believe I have a way to resolve this for my core requirements.

  • Dates are all showing up relative to UTC
  • Fuzzy dates are being temporary cast to fixed dates while the pipeline runs
  • st_ts and end_ts look stable and working as expected

I'm closing this ticket as all dependencies have been addressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants