Avoid unnecessary Iceberg datafile to onedatafile conversions #330

ashvina · 2024-02-16T06:47:28Z

Fixes #329

This is a performance optimization change and extends the improvements added to DeltaClient to IcebergClient.

The current code in the Iceberg client generates unnecessary objects when computing the file diff to find new and removed files. The process first converts all table format data files of the current snapshot to OneDataFiles, uses OneDataFiles to compute the diff, and then converts the resulting OneDataFiles collection back to table format data file objects for writing. There is an unnecessary round trip here. For large tables with thousands of data files in a snapshot, this results in the creation of a large number of objects unnecessarily.

This change optimizes this process by skipping the unnecessary conversions. This optimization does not change the behavior of the translation. This change does not break backward compatibility and is already covered by existing tests.

ksumit

Do we have similar code in other converters?

ksumit · 2024-02-21T02:59:33Z

api/src/main/java/io/onetable/model/storage/OneDataFilesDiff.java

+   * @param <P> the type of the previous files
+   * @return the set of files that are added
+   */
+  public static <L, P> Set<L> findNewAndRemovedFiles(


It feels weird that this method returns partial result in return value and partial result by mutating one of the inputs. Should we consider mutating latestFiles as well inside the method and return void instead?

You make a good point. It would be cleaner to have the method return the values in a consistent pattern. Thanks, I will update the code.

ashvina · 2024-02-21T17:00:55Z

Do we have similar code in other converters?

The Delta client also had this issue, it was resolved previously in #326. The Hudi client does not appear to have this inefficiency.

ksumit · 2024-02-22T23:08:06Z

api/src/main/java/io/onetable/model/storage/DataFilesDiff.java

+   */
+  public static <L, P> DataFilesDiff<L, P> findNewAndRemovedFiles(
+      Map<String, L> latestFiles, Map<String, P> previousFiles) {
+    Set<L> newFiles = new HashSet<>();


we are creating new objects here, no? I looked at the usages and looks like we could avoid creation of these objects?

May be suitable for next level of optimization if we need it. Closing for now.

jcamachor · 2024-02-23T17:45:17Z

LGTM, +1

Avoid unnecessary iceberg to onedata file conversions

4f5fc51

ashvina linked an issue Feb 16, 2024 that may be closed by this pull request

Optimize data file conversions in Iceberg and Hudi target clients #329

Closed

ashvina marked this pull request as draft February 16, 2024 06:47

ashvina requested a review from the-other-tim-brown February 16, 2024 18:22

ashvina marked this pull request as ready for review February 16, 2024 19:28

Add unit tests

5fc4995

ksumit reviewed Feb 21, 2024

View reviewed changes

ashvina added 2 commits February 22, 2024 12:01

Address review comments

1dd0604

Add missing test changes

da75c39

ashvina requested a review from ksumit February 22, 2024 22:41

Minor refactoring

9a464d9

ksumit reviewed Feb 22, 2024

View reviewed changes

ksumit approved these changes Feb 23, 2024

View reviewed changes

jcamachor self-requested a review February 23, 2024 17:44

jcamachor approved these changes Feb 23, 2024

View reviewed changes

ashvina merged commit 822b51d into main Feb 23, 2024
1 check passed

ashvina deleted the 329-optimize-data-file-conversions-in-target-clients branch February 23, 2024 18:07

vinishjail97 mentioned this pull request Aug 16, 2024

XTable 1st Release #486

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid unnecessary Iceberg datafile to onedatafile conversions #330

Avoid unnecessary Iceberg datafile to onedatafile conversions #330

ashvina commented Feb 16, 2024 •

edited

Loading

ksumit left a comment

ksumit Feb 21, 2024

ashvina Feb 21, 2024

ashvina commented Feb 21, 2024

ksumit Feb 22, 2024

ksumit Feb 23, 2024

jcamachor commented Feb 23, 2024

Avoid unnecessary Iceberg datafile to onedatafile conversions #330

Avoid unnecessary Iceberg datafile to onedatafile conversions #330

Conversation

ashvina commented Feb 16, 2024 • edited Loading

ksumit left a comment

Choose a reason for hiding this comment

ksumit Feb 21, 2024

Choose a reason for hiding this comment

ashvina Feb 21, 2024

Choose a reason for hiding this comment

ashvina commented Feb 21, 2024

ksumit Feb 22, 2024

Choose a reason for hiding this comment

ksumit Feb 23, 2024

Choose a reason for hiding this comment

jcamachor commented Feb 23, 2024

ashvina commented Feb 16, 2024 •

edited

Loading