[Datasets] Add vectorized global and grouped aggregations. #23478

clarkzinzow · 2022-03-24T19:31:26Z

This PR adds support for vectorized global and grouped aggregations, porting the built-in aggregations to vectorized block aggregations for tabular datasets.

The AggregateFn abstraction is extended to have an optional vectorized_aggregate function that performs a vectorized aggregation on a single block, allowing aggregations to opt-in to vectorized block aggregation. The AggregateFn also exposes a can_vectorize_for_block() API, which allows aggregations to opt-in for vectorized block aggregation for only certain block types, e.g. only for Arrow and Pandas blocks. The built-in set of aggregations currently only opts-in to vectorized block aggregation on tabular datasets, i.e. only for Arrow and Pandas blocks, since vectorized aggregation of simple blocks will amount to the accumulator loop with extra copying for each group slice (no zero-copy views possible for Python lists).

For Arrow blocks, vectorized block aggregation is supported by creating zero-copy views on each group slice within each partition and applying the vectorized aggregation on these group slice views. This currently entails two scans of each block partition: one to determine the group view boundaries, and one to process each group. As a future optimization, we could eliminate this extra scan by gathering group boundaries while sorting and partitioning each block along the sample boundaries.

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

python/ray/data/aggregate.py

python/ray/data/dataset.py

ericl

High level looks good! @jianoaix @jjyao can you review for implementation?

Btw, I think we should consider (micro)benchmarks for agg / groupby here, similar to the existing ray_perf tests. It will be easy to regress by accident.

python/ray/data/aggregate.py

python/ray/data/impl/null_aggregate.py

python/ray/data/impl/arrow_block.py

python/ray/data/impl/null_aggregate.py

python/ray/data/impl/pandas_block.py

python/ray/data/tests/test_dataset.py

jjyao

lg! I'll take a closer look tomorrow.

python/ray/data/aggregate.py

python/ray/data/impl/null_aggregate.py

python/ray/data/aggregate.py

python/requirements.txt

python/ray/data/impl/pandas_block.py

python/ray/data/aggregate.py

jianoaix

looking good!

python/ray/data/impl/arrow_block.py

jjyao · 2022-04-07T16:49:48Z

python/ray/data/impl/arrow_block.py

+            )
+
+        if self.num_rows() == 0:
+            return None


why None instead of 0?

We need to be able to delineate between a non-empty table with all nulls (0) and an empty table (None), in case ignore_nulls=False, in which case we'd want to propagate that None in the latter case. See the Mean and Std agg function definitions.

python/ray/data/impl/null_aggregate.py

jjyao · 2022-04-07T20:06:20Z

python/ray/data/impl/pandas_block.py

+            # column will result in an all-None column of object type, which will raise
+            # a type error when attempting to do most binary operations. We explicitly
+            # check for this type failure here so we can properly propagate a null.
+            if np.issubdtype(col.dtype, np.object_) and col.isnull().all():


Could we just check beforehand?

I'm purposefully not checking beforehand since this check is expensive: col.isnull().all() creates a fully copy of the column and then traverses it (2 scans). By only checking it on such a TypeError, we keep this expensive check off of the common critical path.

python/ray/data/impl/pandas_block.py

python/ray/data/impl/simple_block.py

clarkzinzow · 2022-04-13T20:33:21Z

@jjyao I addressed your feedback, ready for another pass.

… to base Block arg.

… block accumulation.

clarkzinzow · 2022-04-14T19:33:26Z

@jjyao Ping on this!

jjyao

Nice!

python/ray/data/impl/arrow_block.py

jianoaix · 2022-04-15T00:08:47Z

Can we have a representative perf test to measure the improvement? (can be a follow-up after this PR)

clarkzinzow · 2022-04-15T01:58:15Z

@jianoaix Added an issue here! #23936

clarkzinzow requested review from ericl, scv119 and jjyao as code owners March 24, 2022 19:31

ericl reviewed Mar 24, 2022

View reviewed changes

python/ray/data/aggregate.py Outdated Show resolved Hide resolved

ericl reviewed Mar 24, 2022

View reviewed changes

python/ray/data/aggregate.py Outdated Show resolved Hide resolved

ericl reviewed Mar 24, 2022

View reviewed changes

python/ray/data/dataset.py Outdated Show resolved Hide resolved

ericl reviewed Mar 24, 2022

View reviewed changes

python/ray/data/dataset.py Outdated Show resolved Hide resolved

clarkzinzow changed the title ~~[Datasets] Add prototype for vectorized global aggregations.~~ [Datasets] Add vectorized global and grouped aggregations. Mar 29, 2022

clarkzinzow requested a review from ericl March 29, 2022 19:23

ericl reviewed Mar 29, 2022

View reviewed changes

ericl assigned jjyao and jianoaix Mar 29, 2022

clarkzinzow force-pushed the datasets/feat/vectorized-global-aggregations branch 4 times, most recently from a37d1bd to 3622a0a Compare March 29, 2022 20:30

jjyao reviewed Mar 29, 2022

View reviewed changes

jianoaix reviewed Mar 30, 2022

View reviewed changes

python/ray/data/impl/arrow_block.py Show resolved Hide resolved

python/ray/data/impl/null_aggregate.py Show resolved Hide resolved

python/ray/data/impl/pandas_block.py Outdated Show resolved Hide resolved

python/ray/data/tests/test_dataset.py Show resolved Hide resolved

clarkzinzow force-pushed the datasets/feat/vectorized-global-aggregations branch 2 times, most recently from 4f44cb1 to bbb480c Compare March 30, 2022 23:54

clarkzinzow requested review from jianoaix and jjyao March 30, 2022 23:55

jjyao reviewed Mar 31, 2022

View reviewed changes

python/ray/data/aggregate.py Outdated Show resolved Hide resolved

python/ray/data/aggregate.py Outdated Show resolved Hide resolved

python/ray/data/aggregate.py Outdated Show resolved Hide resolved

python/ray/data/aggregate.py Show resolved Hide resolved

clarkzinzow requested review from richardliaw and edoakes as code owners March 31, 2022 18:15

jjyao reviewed Apr 1, 2022

View reviewed changes

jianoaix reviewed Apr 6, 2022

View reviewed changes

python/requirements.txt Outdated Show resolved Hide resolved

python/ray/data/impl/pandas_block.py Outdated Show resolved Hide resolved

python/ray/data/impl/pandas_block.py Outdated Show resolved Hide resolved

python/ray/data/aggregate.py Show resolved Hide resolved

jianoaix approved these changes Apr 6, 2022

View reviewed changes

jjyao reviewed Apr 7, 2022

View reviewed changes

clarkzinzow force-pushed the datasets/feat/vectorized-global-aggregations branch from ae74504 to 69e10f4 Compare April 13, 2022 20:32

clarkzinzow requested a review from jjyao April 13, 2022 20:51

clarkzinzow added 7 commits April 13, 2022 23:38

Add prototype for vectorized global aggregations.

ca3af3a

Unify vectorized aggregations with groupby aggregations.

6dd7823

PR feedback.

65f8ed8

PR feedback: accumulate --> accumulate_row, port accumulate_block API…

959b065

… to base Block arg.

PR feedback: vectorized count, merge base accumulation and vectorized…

0a934de

… block accumulation.

Use _apply_agg for count.

2668399

PR feedback: reuse null_merge, typing, comments, misc. cleanup

55bac7b

clarkzinzow force-pushed the datasets/feat/vectorized-global-aggregations branch from 69e10f4 to 55bac7b Compare April 13, 2022 23:39

jjyao approved these changes Apr 14, 2022

View reviewed changes

python/ray/data/impl/arrow_block.py Show resolved Hide resolved

clarkzinzow merged commit 3b09fdd into ray-project:master Apr 15, 2022

clarkzinzow deleted the datasets/feat/vectorized-global-aggregations branch April 15, 2022 01:58

clarkzinzow restored the datasets/feat/vectorized-global-aggregations branch April 1, 2023 02:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Datasets] Add vectorized global and grouped aggregations. #23478

[Datasets] Add vectorized global and grouped aggregations. #23478

clarkzinzow commented Mar 24, 2022 •

edited

Loading

ericl left a comment

jjyao left a comment

jianoaix left a comment

jjyao Apr 7, 2022

clarkzinzow Apr 13, 2022 •

edited

Loading

jjyao Apr 7, 2022

clarkzinzow Apr 13, 2022

clarkzinzow commented Apr 13, 2022

clarkzinzow commented Apr 14, 2022

jjyao left a comment

jianoaix commented Apr 15, 2022

clarkzinzow commented Apr 15, 2022

[Datasets] Add vectorized global and grouped aggregations. #23478

[Datasets] Add vectorized global and grouped aggregations. #23478

Conversation

clarkzinzow commented Mar 24, 2022 • edited Loading

Checks

ericl left a comment

Choose a reason for hiding this comment

jjyao left a comment

Choose a reason for hiding this comment

jianoaix left a comment

Choose a reason for hiding this comment

jjyao Apr 7, 2022

Choose a reason for hiding this comment

clarkzinzow Apr 13, 2022 • edited Loading

Choose a reason for hiding this comment

jjyao Apr 7, 2022

Choose a reason for hiding this comment

clarkzinzow Apr 13, 2022

Choose a reason for hiding this comment

clarkzinzow commented Apr 13, 2022

clarkzinzow commented Apr 14, 2022

jjyao left a comment

Choose a reason for hiding this comment

jianoaix commented Apr 15, 2022

clarkzinzow commented Apr 15, 2022

clarkzinzow commented Mar 24, 2022 •

edited

Loading

clarkzinzow Apr 13, 2022 •

edited

Loading