Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Lake, UX] Lake supports >1 writers, incl predictoor bots and >1 pdr lakes #564

Open
6 tasks
trentmc opened this issue Jan 20, 2024 · 2 comments
Open
6 tasks
Labels
Type: Enhancement New feature or request

Comments

@trentmc
Copy link
Member

trentmc commented Jan 20, 2024

Background / motivation

Approach 0: each app writes lake, in its process. Predictoor bot updates the data lake, then uses the lake, in time for making predictions. Same for other apps in pdr-backend.

Approach 1: separate lake, started separately

  • User starts a separate process pdr lake. It's constantly writing to the data lake
  • Predictoor bot reads from the lake, but does not write (for safety). Same for other apps.
  • This was the idea when we conceived of lake.

But we can do better yet, leveraging the database concept of "locking" which enables >1 writers without hurting DB safety. Writers must handle contention due to locks, eg by waiting.

Approach 2: allow >1 writers.

  • We could have >1 lake processes / threads, predictoor bots, or other apps.
  • Support two flows:
    • Flow 1: quickstart: start pdr lake inside the app. Eg user starts one pdr predictoor process (and nothing else). The predictoor bot will detect whether a lake process is running, and start one if needed.
    • Flow 2: power-predictoor usage: start pdr lake separately. Eg a user starts pdr lake, then 20 pdr predictoor processes, one for each feed to predict
    • Flow 3: power-lake usage: >1 lake processes / threads filling complementary parts of lake (different pairs, different subgraph queries). Eg user starts 1 pdr lake process, and it starts >1 threads. Eg user starts 1 process, then later one, a different one with different goals. Eg >1 users start different processes
  • Benefits: (a) more convenient: users don't need to kick off the lake process themselves. (b) faster: because parallel fill (c) more flexible: users (or predictoor bots) can start more lake processes without worry

Approach 2 is endgame. The benefits compared to 1 are immense, let alone 0.

Q: Should we go from 0 -> 1 -> 2, or 0->2 directly?

  • I (Trent) recommend the 0->2 directly because of the big benefits
  • Whereas doing 1 in between would force the user to have to change behavior. (And extra effort for us overall: much of the code we'd write for 1 would be thrown away for 2)

TODOs

  • Locking core: Update lake to support "locking" concept. Such that I could run >1 different pdr lake processes against the same feed, and they wouldn't fight with each other
  • Parallel fill: Update lake to run >1 threads within a single pdr lake process, 1 thread per ohlcv pair or subgraph feed
  • Update predictoor bot: detect whether a lake process is running, and start one if needed.
  • Similarly, update xpmt_engine (nee sim_engine) flow
  • Similarly, update analytics apps flows
  • Ensure READMEs are all updated accordingly. predictoor.md and trader.md should teach the user about how to run pdr lake separately (at the end of the README)
@trentmc trentmc added the Type: Enhancement New feature or request label Jan 20, 2024
@trentmc trentmc changed the title [Lake, UX] Separate the process for pr [Lake, UX] Lake supports >1 writers, incl predictoor bots and >1 pdr lakes Jan 20, 2024
@idiom-bytes
Copy link
Member

idiom-bytes commented Jan 23, 2024

I agree in general that all services (including different agents/bots) would benefit from having a lake that's just up-to-date. And having a process that's solely-responsible for doing this is the way forward.

I think there might be other approaches like "swapping tables", or updating a pointer to the latest table, that might more productive to implement than locking.

What I originally considered was building a base table.py object, that would abstract the schema, return the df, point to a file, etc... The basic structure can be found on table_pdr_predictions, table_pdr_subscriptions. Anything that reads from the lake, would do so through the Table() interface, not DataFactory(). This way, DataFactories are operating on their own, updating the lake, while components/users can access via the interface.

@idiom-bytes
Copy link
Member

idiom-bytes commented Jun 25, 2024

DuckDB only lets you have 1 writer process at a time that holds the db writer connection. Within this, you can then have multiple threads/operating on it. So, for the duckdb "container/process/vm" we should make it as big as possible.

There is now a task for making sure that Lake/ETL has an "update process" that sits there indefinitely looping and updating the lake #1107

Forward Looking:

  • We can have requests/queries/etc pushed to "duckdb writer/service" via a simple API
  • To solve for "Approach 2" / multiple writers, it would entail a clustered db (I.e. clickhouse)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants