[Datasets] Add initial aggregate benchmark #28486

c21 · 2022-09-13T23:09:43Z

Signed-off-by: Cheng Su [email protected]

Why are these changes needed?

This PR is to add initial aggregate benchmark (for h2oai benchmark - https://github.com/h2oai/db-benchmark). To follow the convention in h2oai benchmark, the benchmark is run on single node (https://h2oai.github.io/db-benchmark/#environment-configuration). No fandamental blocker for us to run the benchmark in multiple nodes (just a matter to change our yaml file). The benchmark has 3 input files setting - 0.5GB, 5GB and 50GB. Here we start with 0.5GB input file. Followup PR will add benchmark for 5GB and 50GB (just a matter to generate input file, no benchmark code change needed).

NOTE: Didn't optimize the benchmark queries yet, and just write the most straight-forward version of code here. We can use this as a baseline to fix the performance gap and optimize it.

A typical benchmark workflow would be:

Create xxx_benchmark.py file for the specific APIs to benchmark (e.g. split_benchmark.py for split-related APIs).
Use Benchmark class to run benchmark.
Check in benchmark code after testing locally and workspace.
Monitor nightly tests result.
Create Preset/Databricks dashboard and alert on benchmark result.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Tested on workspace with same cluster environment as aggregate_benchmark_compute.yaml.
Verified benchmark succeed - https://console.anyscale-staging.com/o/anyscale-internal/workspaces/expwrk_SkskBJ2Um8GMzaDAL4Zn8nvb/ses_LCGGwWMb1a14Zp5hqq6mTXvQ .

(base) ray:~/default% python release/nightly_tests/dataset/aggregate_benchmark.py 
Running benchmark: aggregate
Running case: h2oai-500M-q1
Result of case h2oai-500M-q1: {'time': 13.886677083000002}
Running case: h2oai-500M-q3
Result of case h2oai-500M-q3: {'time': 23.728328249999997}
Running case: h2oai-500M-q4
Result of case h2oai-500M-q4: {'time': 13.20559575}
Running case: h2oai-500M-q5
Result of case h2oai-500M-q5: {'time': 25.303984000000007}
Running case: h2oai-500M-q7
Result of case h2oai-500M-q7: {'time': 25.987803}
Running case: h2oai-500M-q8
Result of case h2oai-500M-q8: {'time': 19.922149207999993}
Finish benchmark: aggregate
(base) ray:~/default% cat /tmp/result.json
{"h2oai-500M-q1": {"time": 13.886677083000002}, "h2oai-500M-q3": {"time": 23.728328249999997}, "h2oai-500M-q4": {"time": 13.20559575}, "h2oai-500M-q5": {"time": 25.303984000000007}, "h2oai-500M-q7": {"time": 25.987803}, "h2oai-500M-q8": {"time": 19.922149207999993}}%

c21 · 2022-09-13T23:11:34Z

release/nightly_tests/dataset/benchmark.py

+
+        print(f"Running case: {name}")
+        start_time = time.perf_counter()
+        output_ds = fn(**fn_run_args)


We may want to run the benchmark multiple times to reduce noise. It's easy to add later to have a for-loop on this. Right now the aggregate benchmark does not have any noise worth to rerun.

Sounds good. This needs fn to be stateless/side-effect free.

ericl · 2022-09-19T20:21:07Z

release/nightly_tests/dataset/aggregate_benchmark.py

+    The input files are pre-generated and stored in AWS S3 beforehand.
+    """
+    test_input = [
+        ("s3://air-example-data/h2oai_benchmark/G1_1e7_1e2_0_0.csv", "h2oai-500M")


Is the input really CSV instead of parquet? That seems like it will spend a lot of time decoding just the CSV.

@ericl - yes the input is CSV file - script to generate input file, and Spark script to run the benchmark.

That seems like it will spend a lot of time decoding just the CSV.

That's true, it will be significantly slower than Parquet. But it's just loaded once, and reused across benchmark runs. And the read time is not accounted into benchmark runtime, same to how h2oai db-benchmark measures other system (e.g. Spark script above). Right now read 500MB takes less than 10 seconds, and 5GB takes less than 1 minute.

jianoaix

Thanks for taking a stab at benchmarking!

release/nightly_tests/dataset/aggregate_benchmark.py

jianoaix · 2022-09-20T00:32:52Z

release/nightly_tests/dataset/benchmark.py

+from ray.data.dataset import Dataset
+
+
+class Benchmark:


What's the scope of the Benchmark? IIUC it's a benchmark for Dataset transformations, if so maybe make it more clear about that. Also good to mention that if it's applicable for both local and distribute benchmarking.

Ideally the scope of Benchmark should cover all of data-related benchmark (dataset, dataset pipeline, transform, action, etc), there's no restrict to be used only for dataset transformation. It works for both local and distribute benchmarking. let me add more documentation.

@jianoaix - updated.

If fn is Dataset-to-Dataset mapping, it's basically a transform? Like iter_batches(), min/max etc are not covered.

oh, it's just for easy to retrieve the statistics by returning another Dataset. You can do arbitrary logic inside benchmark:

def fn(input_ds): input_ds.iter_batches(...) input_ds.min() input_ds.max() return the_ds_you_care_for_stats

also just to add - the parameter to fn can be everything, so we are not bound to pass a Dataset.

I'd add a comment about what's expected to return for fn.

@jianoaix - sure, added. Also run(fn: Callable[..., Dataset]) has function's expected return type.

jianoaix · 2022-09-20T00:35:44Z

release/nightly_tests/dataset/aggregate_benchmark.py

+        ("s3://air-example-data/h2oai_benchmark/G1_1e7_1e2_0_0.csv", "h2oai-500M")
+    ]
+    for path, test_name in test_input:
+        input_ds = ray.data.read_csv(path).repartition(10).fully_executed()


Seems a magic number around, can we document how it's chosen? If we doing local node benchmark, can it be just set to the number of CPUs on the node or needs manual tuning?

yeah it should be set to number of CPUs on the node to get best performance. Let me add a comment.

@jianoaix - updated.

To make this benchmark runnable in different cluster setup (currently it's on one node, per yaml config), it'd be better to read num cpus from ray, like ray.cluster_resources().get("CPU", 1) rather than hard coded it.

@jianoaix - thanks, didn't know the API before, updated.

jianoaix · 2022-09-20T20:15:12Z

release/release_tests.yaml

+  working_dir: nightly_tests/dataset
+
+  frequency: multi
+  team: core


Now "data" team is owning the tests.

@jianoaix - good catch, updated.

Signed-off-by: Cheng Su <[email protected]>

c21 · 2022-09-21T03:34:11Z

Addressed all comments, the PR is for ready for review again. Thanks.

clarkzinzow

LGTM!

clarkzinzow · 2022-09-21T14:31:08Z

release/nightly_tests/dataset/aggregate_benchmark.py

+        input_ds = ray.data.read_csv(path)
+        # Number of blocks (parallelism) should be set as number of available CPUs
+        # to get best performance.
+        num_blocks = int(ray.cluster_resources().get("CPU", 1))
+        input_ds = input_ds.repartition(num_blocks).fully_executed()


I'm assuming that we do an explicit repartition step instead of setting parallelism=num_blocks at read time since we're not guaranteed that parallelism will be respected, e.g. if parallelism > num_files?

Yes, there's only 1 file in this case, so have to do repartition.

clarkzinzow · 2022-09-21T14:38:32Z

release/nightly_tests/dataset/aggregate_benchmark.py

+                merge=merge,
+                accumulate_block=accumulate_block,
+                name=(f"top2({str(on)})"),
+            )


It may be a good idea to look at porting this to a custom Polars aggregation once that integration is merged.

Yeah, also plan to enable Polars later.

This PR is to prune (remove) unused columns before doing aggregate (in _GroupbyOp.map()). Only keeps the group-by column and columns used in aggregate functions. All other columns can be pruned, so it reduces the cost during sort and aggregate. Also introduce BlockAccessor.select(keys) to get a new Block with only selected keys/columns. Refactored existing code path in map_groups to also use the API. Later on, we can use this API to implement Dataset.select_columns. Tested with query in h2oai benchmark - #28486 . Reduced query runtime by 50% with this PR.

Signed-off-by: Cheng Su <[email protected]>

c21 commented Sep 13, 2022

View reviewed changes

c21 assigned ericl, clarkzinzow and jianoaix Sep 13, 2022

c21 mentioned this pull request Sep 16, 2022

[Datasets] Prune unused columns before aggregate #28556

Merged

7 tasks

ericl reviewed Sep 19, 2022

View reviewed changes

jianoaix reviewed Sep 20, 2022

View reviewed changes

c21 added 6 commits September 20, 2022 15:47

Add aggregate benchmark

25cf021

Signed-off-by: Cheng Su <[email protected]>

Add repartition

dee3b1a

Signed-off-by: Cheng Su <[email protected]>

Fix q8 after repartition

c7ab839

Signed-off-by: Cheng Su <[email protected]>

Fix lint

075d448

Signed-off-by: Cheng Su <[email protected]>

Address all comments

4ba5b84

Signed-off-by: Cheng Su <[email protected]>

Address comments

f002560

Signed-off-by: Cheng Su <[email protected]>

c21 force-pushed the agg-benchmark branch from 465c5e2 to f002560 Compare September 20, 2022 22:48

ericl approved these changes Sep 21, 2022

View reviewed changes

clarkzinzow approved these changes Sep 21, 2022

View reviewed changes

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Sep 21, 2022

c21 removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Sep 21, 2022

Add more comment for fn in Benchmark

9589f27

Signed-off-by: Cheng Su <[email protected]>

jianoaix approved these changes Sep 21, 2022

View reviewed changes

ericl merged commit db2ce69 into ray-project:master Sep 22, 2022

c21 deleted the agg-benchmark branch September 22, 2022 17:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Datasets] Add initial aggregate benchmark #28486

[Datasets] Add initial aggregate benchmark #28486

c21 commented Sep 13, 2022 •

edited

Loading

c21 Sep 13, 2022 •

edited

Loading

jianoaix Sep 20, 2022

ericl Sep 19, 2022

c21 Sep 19, 2022

jianoaix left a comment

jianoaix Sep 20, 2022

c21 Sep 20, 2022

c21 Sep 20, 2022

jianoaix Sep 20, 2022

c21 Sep 20, 2022

c21 Sep 21, 2022

jianoaix Sep 21, 2022

c21 Sep 21, 2022

jianoaix Sep 20, 2022

c21 Sep 20, 2022

c21 Sep 20, 2022

jianoaix Sep 20, 2022

c21 Sep 20, 2022

jianoaix Sep 20, 2022

c21 Sep 20, 2022

c21 commented Sep 21, 2022

clarkzinzow left a comment

clarkzinzow Sep 21, 2022

c21 Sep 21, 2022

clarkzinzow Sep 21, 2022

c21 Sep 21, 2022

[Datasets] Add initial aggregate benchmark #28486

[Datasets] Add initial aggregate benchmark #28486

Conversation

c21 commented Sep 13, 2022 • edited Loading

Why are these changes needed?

Related issue number

Checks

c21 Sep 13, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jianoaix left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

c21 commented Sep 21, 2022

clarkzinzow left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

c21 commented Sep 13, 2022 •

edited

Loading

c21 Sep 13, 2022 •

edited

Loading