Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark 3.5: Rework DeleteFileIndexBenchmark #9165

Merged
merged 2 commits into from
Dec 8, 2023

Conversation

aokolnychyi
Copy link
Contributor

This PR refactors DeleteFileIndexBenchmark:

  • Add FileGenerationUtil to be used in metadata benchmarks.
  • Use FileGenerationUtil to generate files to speed up the initialization phase of the delete file index benchmark. Prior to this change, we were adding a few real files and cloning them. There is no real benefit of doing 25 real inserts per each partition, it just takes time.
  • Use FileGenerationUtil to generate random metrics to mimic realistic scenarios. Prior to this change, we cloned the metrics and used identical values for all cloned files. This compressed very well but is not realistic.

Copy link
Contributor

@flyrain flyrain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @aokolnychyi for the refactor! LGTM with minor comments.

}

private static long generateRowCount() {
return 100_000L + RANDOM.nextInt(1000);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: Use ThreadLocalRandom.current().nextInt() to ensure thread safe?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}

public static String generateFileName() {
int partitionId = RANDOM.nextInt(100_000);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a partition id in the file name? Files will locate in the partition dir anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It mostly means Spark write partition ID to mimic real file names.

Comment on lines 88 to 89
int taskId = RANDOM.nextInt(100);
UUID operationId = UUID.randomUUID();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious how do we use taskId, operation Id and fileCount of the file name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment indicating that this code replicates OutputFileFactory.

@aokolnychyi aokolnychyi merged commit feeaa8c into apache:main Dec 8, 2023
45 checks passed
@aokolnychyi
Copy link
Contributor Author

Thanks for reviewing, @flyrain!

lisirrx pushed a commit to lisirrx/iceberg that referenced this pull request Jan 4, 2024
devangjhabakh pushed a commit to cdouglas/iceberg that referenced this pull request Apr 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants