[Don't Merge] Setting to control delta job count for each delta write #2031

sh-rp · 2024-11-06T15:46:21Z

Description

NOTE: we'll keep this here in case other users have this problem before the rust implementation is fixed.

See the linked ticket. If we write all jobs at once, apparently the rust implementation load all jobs into memory. If large tables are loaded, this will end up creating a large memory footprint.

We may want to add a setting that controls what the max size of all jobs combined in on write operation maybe. For that we would have to read the filesize of each job from the filesystem, but that is doable. I am not sure.

ToDo:

Maybe write better tests
If this setting is not none, we should force loader_parallelism_strategy to be table-sequential, so the write jobs don't lock each other.

…ble write

netlify · 2024-11-06T15:46:37Z

✅ Deploy Preview for dlt-hub-docs canceled.

Name	Link
🔨 Latest commit	`62d0055`
🔍 Latest deploy log	https://app.netlify.com/sites/dlt-hub-docs/deploys/672b957ada6aab00080cd409

rudolfix

this is not solving a root problem of a large file begin loaded at once.

AFAIK delta-rs is accepting arrow batches and for sure it is accepting data frames,

IMO

keep the original code with a dataset
add a setting ie batch size to split arrow dataset into batches. we should possibly use multiples of row groups as when we batch in normalizer
you feed batches into delta

this way we conserve memory. tbh. we were sure that by using dataset we allow to do efficient batching inside the library. it seems the opposite is true

sh-rp · 2024-11-06T18:18:04Z

@rudolfix the underlying issue in the rust implementation is being worked on here: delta-io/delta-rs#2289. I'm not sure we should fix this ourselves... My worry when we feed in batches is, that if there job somehow aborts for transient reasons, then the whole reference job will be marked as unloaded and all rows of all files are loaded again on the next try and you can end up with double entries. if we split this up into multiple jobs, this will not happen, or if it does, the amount of double rows is less.

jorritsandbrink · 2024-11-07T08:55:25Z

Regarding the suggested solution:

replace write disposition: will it work?
upsert strategy: will it catch duplicate primary_key values if they're in different jobs?
might be slower

My take:

current code does the "right" thing
the delta-rs bug is part of their Rust v1.0.0 Milestone that is due by November 20, so hopefully will be solved soon (one of their main committers self-assigned the issue a couple of days ago)
we should not complicate our code base to work around the bug in the meantime

As a workaround for now, users can prevent memory issues in the load step by yielding a smaller data set during extract?

sh-rp · 2024-11-07T09:40:35Z

@Gilbert09 is loading the table in multiple loads and option until the rust implementation is fixed? I tend to agree with @jorritsandbrink that the rust implementation should behave properly. @jorritsandbrink fyi: @Gilbert09 is loading large tables with the merge write_disposition.

Gilbert09 · 2024-11-07T10:14:55Z

replace write disposition: will it work?

No, this doesn't work right now, I'm having to force users to use merge for these large tables until the Rust implementation is fixed.

might be slower

It's certainly slower - there's a non-nominal cost in overheads coming from somewhere between each job (likely a delta-rs overhead)

In theory, this doesn't actually need to be merged. I've solved half the issue by monkey patching our DLT implementation (see PostHog/posthog#26040) - the real solution here is the rust implementation being fixed which looks like has some progress now

sh-rp · 2024-11-07T13:13:05Z

@Gilbert09 alright, then I think what I'll do is add a warning to the docs page for now and not merge this.

rudolfix · 2024-11-07T15:18:31Z

@rudolfix the underlying issue in the rust implementation is being worked on here: delta-io/delta-rs#2289. I'm not sure we should fix this ourselves... My worry when we feed in batches is, that if there job somehow aborts for transient reasons, then the whole reference job will be marked as unloaded and all rows of all files are loaded again on the next try and you can end up with double entries. if we split this up into multiple jobs, this will not happen, or if it does, the amount of double rows is less.

right! we'll need to use transactions then possibly accumulating all data again in the memory... waiting for delta-rs fix is probably the best way. should we close this PR

sh-rp · 2024-11-07T17:35:10Z

closing as won't fix (or rather will be fixed by delta rs hopefully)

add setting to control how many delta jobs are written in on delta ta…

3ad2bab

…ble write

sh-rp linked an issue Nov 6, 2024 that may be closed by this pull request

Improve delta table memory footprint #2030

Open

sh-rp requested a review from jorritsandbrink November 6, 2024 15:46

sh-rp added 2 commits November 6, 2024 16:51

remove print statement and make test assertion much better

a1ad07a

fix case when there are no followup jobs

62d0055

rudolfix requested changes Nov 6, 2024

View reviewed changes

Gilbert09 mentioned this pull request Nov 6, 2024

fix(data-warehouse): Monkey patch DLT to reduce mem consumption PostHog/posthog#26040

Merged

sh-rp changed the title ~~[WIP] Setting to control delta job count for each delta write~~ [Don't Merge] Setting to control delta job count for each delta write Nov 7, 2024

sh-rp closed this Nov 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Don't Merge] Setting to control delta job count for each delta write #2031

[Don't Merge] Setting to control delta job count for each delta write #2031

sh-rp commented Nov 6, 2024 •

edited

Loading

netlify bot commented Nov 6, 2024 •

edited

Loading

rudolfix left a comment

sh-rp commented Nov 6, 2024

jorritsandbrink commented Nov 7, 2024

sh-rp commented Nov 7, 2024 •

edited

Loading

Gilbert09 commented Nov 7, 2024

sh-rp commented Nov 7, 2024

rudolfix commented Nov 7, 2024

sh-rp commented Nov 7, 2024

[Don't Merge] Setting to control delta job count for each delta write #2031

[Don't Merge] Setting to control delta job count for each delta write #2031

Conversation

sh-rp commented Nov 6, 2024 • edited Loading

Description

netlify bot commented Nov 6, 2024 • edited Loading

✅ Deploy Preview for dlt-hub-docs canceled.

rudolfix left a comment

Choose a reason for hiding this comment

sh-rp commented Nov 6, 2024

jorritsandbrink commented Nov 7, 2024

sh-rp commented Nov 7, 2024 • edited Loading

Gilbert09 commented Nov 7, 2024

sh-rp commented Nov 7, 2024

rudolfix commented Nov 7, 2024

sh-rp commented Nov 7, 2024

sh-rp commented Nov 6, 2024 •

edited

Loading

netlify bot commented Nov 6, 2024 •

edited

Loading

sh-rp commented Nov 7, 2024 •

edited

Loading