Spark 3.5: Rework DeleteFileIndexBenchmark #9165

aokolnychyi · 2023-11-27T20:04:35Z

This PR refactors DeleteFileIndexBenchmark:

Add FileGenerationUtil to be used in metadata benchmarks.
Use FileGenerationUtil to generate files to speed up the initialization phase of the delete file index benchmark. Prior to this change, we were adding a few real files and cloning them. There is no real benefit of doing 25 real inserts per each partition, it just takes time.
Use FileGenerationUtil to generate random metrics to mimic realistic scenarios. Prior to this change, we cloned the metrics and used identical values for all cloned files. This compressed very well but is not realistic.

flyrain

Thanks @aokolnychyi for the refactor! LGTM with minor comments.

flyrain · 2023-12-07T17:52:34Z

core/src/test/java/org/apache/iceberg/FileGenerationUtil.java

+  }
+
+  private static long generateRowCount() {
+    return 100_000L + RANDOM.nextInt(1000);


Minor: Use ThreadLocalRandom.current().nextInt() to ensure thread safe?

flyrain · 2023-12-07T18:02:36Z

core/src/test/java/org/apache/iceberg/FileGenerationUtil.java

+  }
+
+  public static String generateFileName() {
+    int partitionId = RANDOM.nextInt(100_000);


Do we need a partition id in the file name? Files will locate in the partition dir anyway.

It mostly means Spark write partition ID to mimic real file names.

flyrain · 2023-12-07T18:03:27Z

core/src/test/java/org/apache/iceberg/FileGenerationUtil.java

+    int taskId = RANDOM.nextInt(100);
+    UUID operationId = UUID.randomUUID();


Curious how do we use taskId, operation Id and fileCount of the file name.

Added a comment indicating that this code replicates OutputFileFactory.

aokolnychyi · 2023-12-08T09:22:45Z

Thanks for reviewing, @flyrain!

aokolnychyi force-pushed the rework-delete-file-index-bench branch from 8c5dc80 to aa5bf68 Compare November 27, 2023 20:24

github-actions bot added spark core build labels Nov 27, 2023

Spark 3.5: Rework DeleteFileIndexBenchmark

f1c0fa8

aokolnychyi force-pushed the rework-delete-file-index-bench branch from aa5bf68 to f1c0fa8 Compare November 27, 2023 22:35

flyrain approved these changes Dec 7, 2023

View reviewed changes

Review

63d904d

aokolnychyi merged commit feeaa8c into apache:main Dec 8, 2023
45 checks passed

This was referenced Dec 11, 2023

streaming update jasonf20/iceberg#1

Closed

streaming update jasonf20/iceberg#2

Closed

lisirrx pushed a commit to lisirrx/iceberg that referenced this pull request Jan 4, 2024

Spark 3.5: Rework DeleteFileIndexBenchmark (apache#9165)

d71faa6

aokolnychyi mentioned this pull request Jan 31, 2024

Spark 3.4: Rework DeleteFileIndexBenchmark #9600

Merged

devangjhabakh pushed a commit to cdouglas/iceberg that referenced this pull request Apr 22, 2024

Spark 3.5: Rework DeleteFileIndexBenchmark (apache#9165)

5954c64

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark 3.5: Rework DeleteFileIndexBenchmark #9165

Spark 3.5: Rework DeleteFileIndexBenchmark #9165

aokolnychyi commented Nov 27, 2023

flyrain left a comment

flyrain Dec 7, 2023

aokolnychyi Dec 8, 2023

flyrain Dec 7, 2023

aokolnychyi Dec 8, 2023

flyrain Dec 7, 2023

aokolnychyi Dec 8, 2023

aokolnychyi commented Dec 8, 2023

		int taskId = RANDOM.nextInt(100);
		UUID operationId = UUID.randomUUID();

Spark 3.5: Rework DeleteFileIndexBenchmark #9165

Spark 3.5: Rework DeleteFileIndexBenchmark #9165

Conversation

aokolnychyi commented Nov 27, 2023

flyrain left a comment

Choose a reason for hiding this comment

flyrain Dec 7, 2023

Choose a reason for hiding this comment

aokolnychyi Dec 8, 2023

Choose a reason for hiding this comment

flyrain Dec 7, 2023

Choose a reason for hiding this comment

aokolnychyi Dec 8, 2023

Choose a reason for hiding this comment

flyrain Dec 7, 2023

Choose a reason for hiding this comment

aokolnychyi Dec 8, 2023

Choose a reason for hiding this comment

aokolnychyi commented Dec 8, 2023