Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Datasets] Add initial aggregate benchmark #28486

Merged
merged 7 commits into from
Sep 22, 2022
Merged

Conversation

c21
Copy link
Contributor

@c21 c21 commented Sep 13, 2022

Signed-off-by: Cheng Su [email protected]

Why are these changes needed?

This PR is to add initial aggregate benchmark (for h2oai benchmark - https://github.com/h2oai/db-benchmark). To follow the convention in h2oai benchmark, the benchmark is run on single node (https://h2oai.github.io/db-benchmark/#environment-configuration). No fandamental blocker for us to run the benchmark in multiple nodes (just a matter to change our yaml file). The benchmark has 3 input files setting - 0.5GB, 5GB and 50GB. Here we start with 0.5GB input file. Followup PR will add benchmark for 5GB and 50GB (just a matter to generate input file, no benchmark code change needed).

NOTE: Didn't optimize the benchmark queries yet, and just write the most straight-forward version of code here. We can use this as a baseline to fix the performance gap and optimize it.

A typical benchmark workflow would be:

  1. Create xxx_benchmark.py file for the specific APIs to benchmark (e.g. split_benchmark.py for split-related APIs).
  2. Use Benchmark class to run benchmark.
  3. Check in benchmark code after testing locally and workspace.
  4. Monitor nightly tests result.
  5. Create Preset/Databricks dashboard and alert on benchmark result.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Tested on workspace with same cluster environment as aggregate_benchmark_compute.yaml.
Verified benchmark succeed - https://console.anyscale-staging.com/o/anyscale-internal/workspaces/expwrk_SkskBJ2Um8GMzaDAL4Zn8nvb/ses_LCGGwWMb1a14Zp5hqq6mTXvQ .

(base) ray:~/default% python release/nightly_tests/dataset/aggregate_benchmark.py 
Running benchmark: aggregate
Running case: h2oai-500M-q1
Result of case h2oai-500M-q1: {'time': 13.886677083000002}
Running case: h2oai-500M-q3
Result of case h2oai-500M-q3: {'time': 23.728328249999997}
Running case: h2oai-500M-q4
Result of case h2oai-500M-q4: {'time': 13.20559575}
Running case: h2oai-500M-q5
Result of case h2oai-500M-q5: {'time': 25.303984000000007}
Running case: h2oai-500M-q7
Result of case h2oai-500M-q7: {'time': 25.987803}
Running case: h2oai-500M-q8
Result of case h2oai-500M-q8: {'time': 19.922149207999993}
Finish benchmark: aggregate
(base) ray:~/default% cat /tmp/result.json
{"h2oai-500M-q1": {"time": 13.886677083000002}, "h2oai-500M-q3": {"time": 23.728328249999997}, "h2oai-500M-q4": {"time": 13.20559575}, "h2oai-500M-q5": {"time": 25.303984000000007}, "h2oai-500M-q7": {"time": 25.987803}, "h2oai-500M-q8": {"time": 19.922149207999993}}%


print(f"Running case: {name}")
start_time = time.perf_counter()
output_ds = fn(**fn_run_args)
Copy link
Contributor Author

@c21 c21 Sep 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to run the benchmark multiple times to reduce noise. It's easy to add later to have a for-loop on this. Right now the aggregate benchmark does not have any noise worth to rerun.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. This needs fn to be stateless/side-effect free.

The input files are pre-generated and stored in AWS S3 beforehand.
"""
test_input = [
("s3://air-example-data/h2oai_benchmark/G1_1e7_1e2_0_0.csv", "h2oai-500M")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the input really CSV instead of parquet? That seems like it will spend a lot of time decoding just the CSV.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ericl - yes the input is CSV file - script to generate input file, and Spark script to run the benchmark.

That seems like it will spend a lot of time decoding just the CSV.

That's true, it will be significantly slower than Parquet. But it's just loaded once, and reused across benchmark runs. And the read time is not accounted into benchmark runtime, same to how h2oai db-benchmark measures other system (e.g. Spark script above). Right now read 500MB takes less than 10 seconds, and 5GB takes less than 1 minute.

Copy link
Contributor

@jianoaix jianoaix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking a stab at benchmarking!

from ray.data.dataset import Dataset


class Benchmark:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the scope of the Benchmark? IIUC it's a benchmark for Dataset transformations, if so maybe make it more clear about that. Also good to mention that if it's applicable for both local and distribute benchmarking.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally the scope of Benchmark should cover all of data-related benchmark (dataset, dataset pipeline, transform, action, etc), there's no restrict to be used only for dataset transformation. It works for both local and distribute benchmarking. let me add more documentation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jianoaix - updated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If fn is Dataset-to-Dataset mapping, it's basically a transform? Like iter_batches(), min/max etc are not covered.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, it's just for easy to retrieve the statistics by returning another Dataset. You can do arbitrary logic inside benchmark:

def fn(input_ds):
    input_ds.iter_batches(...)
    input_ds.min()
    input_ds.max()
    return the_ds_you_care_for_stats

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also just to add - the parameter to fn can be everything, so we are not bound to pass a Dataset.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add a comment about what's expected to return for fn.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jianoaix - sure, added. Also run(fn: Callable[..., Dataset]) has function's expected return type.

("s3://air-example-data/h2oai_benchmark/G1_1e7_1e2_0_0.csv", "h2oai-500M")
]
for path, test_name in test_input:
input_ds = ray.data.read_csv(path).repartition(10).fully_executed()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems a magic number around, can we document how it's chosen? If we doing local node benchmark, can it be just set to the number of CPUs on the node or needs manual tuning?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah it should be set to number of CPUs on the node to get best performance. Let me add a comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jianoaix - updated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make this benchmark runnable in different cluster setup (currently it's on one node, per yaml config), it'd be better to read num cpus from ray, like ray.cluster_resources().get("CPU", 1) rather than hard coded it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jianoaix - thanks, didn't know the API before, updated.

working_dir: nightly_tests/dataset

frequency: multi
team: core
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now "data" team is owning the tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jianoaix - good catch, updated.

Signed-off-by: Cheng Su <[email protected]>
Signed-off-by: Cheng Su <[email protected]>
Signed-off-by: Cheng Su <[email protected]>
Signed-off-by: Cheng Su <[email protected]>
Signed-off-by: Cheng Su <[email protected]>
@c21
Copy link
Contributor Author

c21 commented Sep 21, 2022

Addressed all comments, the PR is for ready for review again. Thanks.

Copy link
Contributor

@clarkzinzow clarkzinzow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Comment on lines +27 to +31
input_ds = ray.data.read_csv(path)
# Number of blocks (parallelism) should be set as number of available CPUs
# to get best performance.
num_blocks = int(ray.cluster_resources().get("CPU", 1))
input_ds = input_ds.repartition(num_blocks).fully_executed()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming that we do an explicit repartition step instead of setting parallelism=num_blocks at read time since we're not guaranteed that parallelism will be respected, e.g. if parallelism > num_files?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there's only 1 file in this case, so have to do repartition.

merge=merge,
accumulate_block=accumulate_block,
name=(f"top2({str(on)})"),
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be a good idea to look at porting this to a custom Polars aggregation once that integration is merged.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, also plan to enable Polars later.

clarkzinzow pushed a commit that referenced this pull request Sep 21, 2022
This PR is to prune (remove) unused columns before doing aggregate (in _GroupbyOp.map()). Only keeps the group-by column and columns used in aggregate functions. All other columns can be pruned, so it reduces the cost during sort and aggregate.

Also introduce BlockAccessor.select(keys) to get a new Block with only selected keys/columns. Refactored existing code path in map_groups to also use the API. Later on, we can use this API to implement Dataset.select_columns.

Tested with query in h2oai benchmark - #28486 . Reduced query runtime by 50% with this PR.
@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Sep 21, 2022
@c21 c21 removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Sep 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants