[Lake, UX] Lake supports >1 writers, incl predictoor bots and >1 pdr lakes #564

trentmc · 2024-01-20T06:25:56Z

Background / motivation

Approach 0: each app writes lake, in its process. Predictoor bot updates the data lake, then uses the lake, in time for making predictions. Same for other apps in pdr-backend.

Approach 1: separate lake, started separately

User starts a separate process pdr lake. It's constantly writing to the data lake
Predictoor bot reads from the lake, but does not write (for safety). Same for other apps.
This was the idea when we conceived of lake.

But we can do better yet, leveraging the database concept of "locking" which enables >1 writers without hurting DB safety. Writers must handle contention due to locks, eg by waiting.

Approach 2: allow >1 writers.

We could have >1 lake processes / threads, predictoor bots, or other apps.
Support two flows:
- Flow 1: quickstart: start pdr lake inside the app. Eg user starts one pdr predictoor process (and nothing else). The predictoor bot will detect whether a lake process is running, and start one if needed.
- Flow 2: power-predictoor usage: start pdr lake separately. Eg a user starts pdr lake, then 20 pdr predictoor processes, one for each feed to predict
- Flow 3: power-lake usage: >1 lake processes / threads filling complementary parts of lake (different pairs, different subgraph queries). Eg user starts 1 pdr lake process, and it starts >1 threads. Eg user starts 1 process, then later one, a different one with different goals. Eg >1 users start different processes
Benefits: (a) more convenient: users don't need to kick off the lake process themselves. (b) faster: because parallel fill (c) more flexible: users (or predictoor bots) can start more lake processes without worry

Approach 2 is endgame. The benefits compared to 1 are immense, let alone 0.

Q: Should we go from 0 -> 1 -> 2, or 0->2 directly?

I (Trent) recommend the 0->2 directly because of the big benefits
Whereas doing 1 in between would force the user to have to change behavior. (And extra effort for us overall: much of the code we'd write for 1 would be thrown away for 2)

TODOs

Locking core: Update lake to support "locking" concept. Such that I could run >1 different pdr lake processes against the same feed, and they wouldn't fight with each other
Parallel fill: Update lake to run >1 threads within a single pdr lake process, 1 thread per ohlcv pair or subgraph feed
Update predictoor bot: detect whether a lake process is running, and start one if needed.
Similarly, update xpmt_engine (nee sim_engine) flow
Similarly, update analytics apps flows
Ensure READMEs are all updated accordingly. predictoor.md and trader.md should teach the user about how to run pdr lake separately (at the end of the README)

The text was updated successfully, but these errors were encountered:

idiom-bytes · 2024-01-23T02:32:55Z

I agree in general that all services (including different agents/bots) would benefit from having a lake that's just up-to-date. And having a process that's solely-responsible for doing this is the way forward.

I think there might be other approaches like "swapping tables", or updating a pointer to the latest table, that might more productive to implement than locking.

What I originally considered was building a base table.py object, that would abstract the schema, return the df, point to a file, etc... The basic structure can be found on table_pdr_predictions, table_pdr_subscriptions. Anything that reads from the lake, would do so through the Table() interface, not DataFactory(). This way, DataFactories are operating on their own, updating the lake, while components/users can access via the interface.

idiom-bytes · 2024-06-25T14:54:53Z

DuckDB only lets you have 1 writer process at a time that holds the db writer connection. Within this, you can then have multiple threads/operating on it. So, for the duckdb "container/process/vm" we should make it as big as possible.

There is now a task for making sure that Lake/ETL has an "update process" that sits there indefinitely looping and updating the lake #1107

Forward Looking:

We can have requests/queries/etc pushed to "duckdb writer/service" via a simple API
To solve for "Approach 2" / multiple writers, it would entail a clustered db (I.e. clickhouse)

trentmc added the Type: Enhancement New feature or request label Jan 20, 2024

trentmc changed the title ~~[Lake, UX] Separate the process for pr~~ [Lake, UX] Lake supports >1 writers, incl predictoor bots and >1 pdr lakes Jan 20, 2024

trentmc mentioned this issue Jan 20, 2024

Fix #471: Update ohlcv-data-factory when predictoor agent initializes #551

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Lake, UX] Lake supports >1 writers, incl predictoor bots and >1 pdr lakes #564

[Lake, UX] Lake supports >1 writers, incl predictoor bots and >1 pdr lakes #564

trentmc commented Jan 20, 2024 •

edited

Loading

idiom-bytes commented Jan 23, 2024 •

edited

Loading

idiom-bytes commented Jun 25, 2024 •

edited

Loading

[Lake, UX] Lake supports >1 writers, incl predictoor bots and >1 pdr lakes #564

[Lake, UX] Lake supports >1 writers, incl predictoor bots and >1 pdr lakes #564

Comments

trentmc commented Jan 20, 2024 • edited Loading

Background / motivation

TODOs

idiom-bytes commented Jan 23, 2024 • edited Loading

idiom-bytes commented Jun 25, 2024 • edited Loading

trentmc commented Jan 20, 2024 •

edited

Loading

idiom-bytes commented Jan 23, 2024 •

edited

Loading

idiom-bytes commented Jun 25, 2024 •

edited

Loading