Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Don't Merge] Setting to control delta job count for each delta write #2031

Closed
wants to merge 3 commits into from

Conversation

sh-rp
Copy link
Collaborator

@sh-rp sh-rp commented Nov 6, 2024

Description

NOTE: we'll keep this here in case other users have this problem before the rust implementation is fixed.

See the linked ticket. If we write all jobs at once, apparently the rust implementation load all jobs into memory. If large tables are loaded, this will end up creating a large memory footprint.

We may want to add a setting that controls what the max size of all jobs combined in on write operation maybe. For that we would have to read the filesize of each job from the filesystem, but that is doable. I am not sure.

ToDo:

  • Maybe write better tests
  • If this setting is not none, we should force loader_parallelism_strategy to be table-sequential, so the write jobs don't lock each other.

@sh-rp sh-rp linked an issue Nov 6, 2024 that may be closed by this pull request
Copy link

netlify bot commented Nov 6, 2024

Deploy Preview for dlt-hub-docs canceled.

Name Link
🔨 Latest commit 62d0055
🔍 Latest deploy log https://app.netlify.com/sites/dlt-hub-docs/deploys/672b957ada6aab00080cd409

Copy link
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not solving a root problem of a large file begin loaded at once.

AFAIK delta-rs is accepting arrow batches and for sure it is accepting data frames,

IMO

  • keep the original code with a dataset
  • add a setting ie batch size to split arrow dataset into batches. we should possibly use multiples of row groups as when we batch in normalizer
  • you feed batches into delta

this way we conserve memory. tbh. we were sure that by using dataset we allow to do efficient batching inside the library. it seems the opposite is true

@sh-rp
Copy link
Collaborator Author

sh-rp commented Nov 6, 2024

@rudolfix the underlying issue in the rust implementation is being worked on here: delta-io/delta-rs#2289. I'm not sure we should fix this ourselves... My worry when we feed in batches is, that if there job somehow aborts for transient reasons, then the whole reference job will be marked as unloaded and all rows of all files are loaded again on the next try and you can end up with double entries. if we split this up into multiple jobs, this will not happen, or if it does, the amount of double rows is less.

@jorritsandbrink
Copy link
Collaborator

Regarding the suggested solution:

  • replace write disposition: will it work?
  • upsert strategy: will it catch duplicate primary_key values if they're in different jobs?
  • might be slower

My take:

  • current code does the "right" thing
  • the delta-rs bug is part of their Rust v1.0.0 Milestone that is due by November 20, so hopefully will be solved soon (one of their main committers self-assigned the issue a couple of days ago)
  • we should not complicate our code base to work around the bug in the meantime

As a workaround for now, users can prevent memory issues in the load step by yielding a smaller data set during extract?

@sh-rp
Copy link
Collaborator Author

sh-rp commented Nov 7, 2024

@Gilbert09 is loading the table in multiple loads and option until the rust implementation is fixed? I tend to agree with @jorritsandbrink that the rust implementation should behave properly. @jorritsandbrink fyi: @Gilbert09 is loading large tables with the merge write_disposition.

@Gilbert09
Copy link

replace write disposition: will it work?

No, this doesn't work right now, I'm having to force users to use merge for these large tables until the Rust implementation is fixed.

might be slower

It's certainly slower - there's a non-nominal cost in overheads coming from somewhere between each job (likely a delta-rs overhead)

In theory, this doesn't actually need to be merged. I've solved half the issue by monkey patching our DLT implementation (see PostHog/posthog#26040) - the real solution here is the rust implementation being fixed which looks like has some progress now

@sh-rp
Copy link
Collaborator Author

sh-rp commented Nov 7, 2024

@Gilbert09 alright, then I think what I'll do is add a warning to the docs page for now and not merge this.

@sh-rp sh-rp changed the title [WIP] Setting to control delta job count for each delta write [Don't Merge] Setting to control delta job count for each delta write Nov 7, 2024
@rudolfix
Copy link
Collaborator

rudolfix commented Nov 7, 2024

@rudolfix the underlying issue in the rust implementation is being worked on here: delta-io/delta-rs#2289. I'm not sure we should fix this ourselves... My worry when we feed in batches is, that if there job somehow aborts for transient reasons, then the whole reference job will be marked as unloaded and all rows of all files are loaded again on the next try and you can end up with double entries. if we split this up into multiple jobs, this will not happen, or if it does, the amount of double rows is less.

right! we'll need to use transactions then possibly accumulating all data again in the memory... waiting for delta-rs fix is probably the best way. should we close this PR

@sh-rp
Copy link
Collaborator Author

sh-rp commented Nov 7, 2024

closing as won't fix (or rather will be fixed by delta rs hopefully)

@sh-rp sh-rp closed this Nov 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve delta table memory footprint
4 participants