Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Datasets] Add initial aggregate benchmark #28486
[Datasets] Add initial aggregate benchmark #28486
Changes from 6 commits
25cf021
dee3b1a
c7ab839
075d448
4ba5b84
f002560
9589f27
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the input really CSV instead of parquet? That seems like it will spend a lot of time decoding just the CSV.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ericl - yes the input is CSV file - script to generate input file, and Spark script to run the benchmark.
That's true, it will be significantly slower than Parquet. But it's just loaded once, and reused across benchmark runs. And the read time is not accounted into benchmark runtime, same to how h2oai db-benchmark measures other system (e.g. Spark script above). Right now read 500MB takes less than 10 seconds, and 5GB takes less than 1 minute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm assuming that we do an explicit repartition step instead of setting
parallelism=num_blocks
at read time since we're not guaranteed thatparallelism
will be respected, e.g. ifparallelism > num_files
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, there's only 1 file in this case, so have to do
repartition
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be a good idea to look at porting this to a custom Polars aggregation once that integration is merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, also plan to enable Polars later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the scope of the Benchmark? IIUC it's a benchmark for Dataset transformations, if so maybe make it more clear about that. Also good to mention that if it's applicable for both local and distribute benchmarking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally the scope of
Benchmark
should cover all of data-related benchmark (dataset, dataset pipeline, transform, action, etc), there's no restrict to be used only for dataset transformation. It works for both local and distribute benchmarking. let me add more documentation.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jianoaix - updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If
fn
is Dataset-to-Dataset mapping, it's basically a transform? Like iter_batches(), min/max etc are not covered.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, it's just for easy to retrieve the statistics by returning another Dataset. You can do arbitrary logic inside benchmark:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also just to add - the parameter to
fn
can be everything, so we are not bound to pass a Dataset.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd add a comment about what's expected to return for fn.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jianoaix - sure, added. Also
run(fn: Callable[..., Dataset])
has function's expected return type.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may want to run the benchmark multiple times to reduce noise. It's easy to add later to have a for-loop on this. Right now the aggregate benchmark does not have any noise worth to rerun.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. This needs
fn
to be stateless/side-effect free.