-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spark 3.5: Rework DeleteFileIndexBenchmark #9165
Spark 3.5: Rework DeleteFileIndexBenchmark #9165
Conversation
8c5dc80
to
aa5bf68
Compare
aa5bf68
to
f1c0fa8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @aokolnychyi for the refactor! LGTM with minor comments.
} | ||
|
||
private static long generateRowCount() { | ||
return 100_000L + RANDOM.nextInt(1000); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: Use ThreadLocalRandom.current().nextInt()
to ensure thread safe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
} | ||
|
||
public static String generateFileName() { | ||
int partitionId = RANDOM.nextInt(100_000); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need a partition id in the file name? Files will locate in the partition dir anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It mostly means Spark write partition ID to mimic real file names.
int taskId = RANDOM.nextInt(100); | ||
UUID operationId = UUID.randomUUID(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious how do we use taskId, operation Id and fileCount of the file name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a comment indicating that this code replicates OutputFileFactory
.
Thanks for reviewing, @flyrain! |
This PR refactors
DeleteFileIndexBenchmark
:FileGenerationUtil
to be used in metadata benchmarks.FileGenerationUtil
to generate files to speed up the initialization phase of the delete file index benchmark. Prior to this change, we were adding a few real files and cloning them. There is no real benefit of doing 25 real inserts per each partition, it just takes time.FileGenerationUtil
to generate random metrics to mimic realistic scenarios. Prior to this change, we cloned the metrics and used identical values for all cloned files. This compressed very well but is not realistic.