-
Notifications
You must be signed in to change notification settings - Fork 146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid unnecessary Iceberg datafile to onedatafile conversions #330
Avoid unnecessary Iceberg datafile to onedatafile conversions #330
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have similar code in other converters?
* @param <P> the type of the previous files | ||
* @return the set of files that are added | ||
*/ | ||
public static <L, P> Set<L> findNewAndRemovedFiles( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels weird that this method returns partial result in return value and partial result by mutating one of the inputs. Should we consider mutating latestFiles
as well inside the method and return void instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You make a good point. It would be cleaner to have the method return the values in a consistent pattern. Thanks, I will update the code.
The Delta client also had this issue, it was resolved previously in #326. The Hudi client does not appear to have this inefficiency. |
*/ | ||
public static <L, P> DataFilesDiff<L, P> findNewAndRemovedFiles( | ||
Map<String, L> latestFiles, Map<String, P> previousFiles) { | ||
Set<L> newFiles = new HashSet<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we are creating new objects here, no? I looked at the usages and looks like we could avoid creation of these objects?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be suitable for next level of optimization if we need it. Closing for now.
LGTM, +1 |
Fixes #329
This is a performance optimization change and extends the improvements added to DeltaClient to IcebergClient.
The current code in the Iceberg client generates unnecessary objects when computing the file diff to find new and removed files. The process first converts all table format data files of the current snapshot to OneDataFiles, uses OneDataFiles to compute the diff, and then converts the resulting OneDataFiles collection back to table format data file objects for writing. There is an unnecessary round trip here. For large tables with thousands of data files in a snapshot, this results in the creation of a large number of objects unnecessarily.
This change optimizes this process by skipping the unnecessary conversions. This optimization does not change the behavior of the translation. This change does not break backward compatibility and is already covered by existing tests.