Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid unnecessary Iceberg datafile to onedatafile conversions #330

Merged
merged 5 commits into from
Feb 23, 2024

Conversation

ashvina
Copy link
Contributor

@ashvina ashvina commented Feb 16, 2024

Fixes #329

This is a performance optimization change and extends the improvements added to DeltaClient to IcebergClient.

The current code in the Iceberg client generates unnecessary objects when computing the file diff to find new and removed files. The process first converts all table format data files of the current snapshot to OneDataFiles, uses OneDataFiles to compute the diff, and then converts the resulting OneDataFiles collection back to table format data file objects for writing. There is an unnecessary round trip here. For large tables with thousands of data files in a snapshot, this results in the creation of a large number of objects unnecessarily.

This change optimizes this process by skipping the unnecessary conversions. This optimization does not change the behavior of the translation. This change does not break backward compatibility and is already covered by existing tests.

@ashvina ashvina linked an issue Feb 16, 2024 that may be closed by this pull request
@ashvina ashvina marked this pull request as draft February 16, 2024 06:47
@ashvina ashvina marked this pull request as ready for review February 16, 2024 19:28
Copy link

@ksumit ksumit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have similar code in other converters?

* @param <P> the type of the previous files
* @return the set of files that are added
*/
public static <L, P> Set<L> findNewAndRemovedFiles(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels weird that this method returns partial result in return value and partial result by mutating one of the inputs. Should we consider mutating latestFiles as well inside the method and return void instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You make a good point. It would be cleaner to have the method return the values in a consistent pattern. Thanks, I will update the code.

@ashvina
Copy link
Contributor Author

ashvina commented Feb 21, 2024

Do we have similar code in other converters?

The Delta client also had this issue, it was resolved previously in #326. The Hudi client does not appear to have this inefficiency.

@ashvina ashvina requested a review from ksumit February 22, 2024 22:41
*/
public static <L, P> DataFilesDiff<L, P> findNewAndRemovedFiles(
Map<String, L> latestFiles, Map<String, P> previousFiles) {
Set<L> newFiles = new HashSet<>();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are creating new objects here, no? I looked at the usages and looks like we could avoid creation of these objects?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be suitable for next level of optimization if we need it. Closing for now.

@jcamachor jcamachor self-requested a review February 23, 2024 17:44
@jcamachor
Copy link
Contributor

LGTM, +1

@ashvina ashvina merged commit 822b51d into main Feb 23, 2024
1 check passed
@ashvina ashvina deleted the 329-optimize-data-file-conversions-in-target-clients branch February 23, 2024 18:07
@vinishjail97 vinishjail97 mentioned this pull request Aug 16, 2024
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Optimize data file conversions in Iceberg and Hudi target clients
3 participants