Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize object creation for new Delta snapshot #326

Merged

Conversation

ashvina
Copy link
Contributor

@ashvina ashvina commented Feb 6, 2024

Fixes #325

The current code in the DeltaClient generates unnecessary objects when computing the file diff to find new and removed files. The process first converts all Delta Actions of the current delta log's snapshot to OneDataFiles, uses OneDataFiles to compute the diff, and then converts the resulting OneDataFiles collection back to Delta Action objects for writing. There is a round trip from Delta Action to OneDataFiles here. For large tables with thousands of Actions in a snapshot, this results in the creation of a large number of objects unnecessarily.

This change optimizes this process by skipping the unnecessary steps of converting delta actions from the previous snapshot into OneDataFiles and then back into delta actions. This optimizations does not change the behavior of the translation.

This change is already covered by existing tests for Delta conversion

@ashvina ashvina linked an issue Feb 6, 2024 that may be closed by this pull request
@ashvina
Copy link
Contributor Author

ashvina commented Feb 6, 2024

CC: @vamshigv

deltaDataFileExtractor.iteratorWithoutStats(deltaLog.snapshot(), tableSchema)) {
fileIterator.forEachRemaining(currentDataFiles::add);
OneDataFilesDiff filesDiff =
OneDataFilesDiff.from(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to do something similar for Iceberg and Hudi targets? Is there any way we can make this logic common?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point.
This logic could be applied to other clients as well. I am going to create a new issue to make this logic more general and implement it for other clients. Merging this PR in the meantime should unblock the team who is waiting for this fix. I will also mark the new issue as a good beginner issue in case any new contributor is interested.

@ashvina ashvina merged commit ac24b29 into main Feb 11, 2024
1 check passed
@ashvina ashvina deleted the 325-optimization-avoid-object-creation-in-diff-computation branch February 11, 2024 20:24
@vinishjail97 vinishjail97 mentioned this pull request Aug 16, 2024
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Optimization] Avoid Object creation in diff computation
2 participants