[Datasets] Multi-aggregations [2/3]: Add groupby multi-column/multi-lambda aggregation #20074

clarkzinzow · 2021-11-04T19:27:43Z

As a stacked follow-up to #20044 (only the last commit contains incremental changes from original PR), this PR adds multi-column/multi-lambda aggregations, making it much easier to express "do some aggregation on multiple columns".

Mean on 3 columns - old API

from ray.data.aggregate import Mean

ds.aggregate(Mean("A"), Mean("B"), Mean("C"))

Mean on 3 columns - new API

ds.mean(["A", "B", "C"])

This PR also adds thorough checking of the on aggregation argument.

Drivebys

.repartition() wasn't properly handling the num_partitions > num_rows case: for a simple Dataset with simple blocks, the generated empty blocks were Arrow blocks instead of simple blocks, which causes most downstream operations to break. We explicitly handle the empty block case in .repartition(). We should find a way to fix this in general, so I opened an issue for that effort.

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

python/ray/data/dataset.py

python/ray/data/impl/block_list.py

python/ray/data/read_api.py

clarkzinzow · 2021-11-05T05:02:24Z

python/ray/data/tests/test_dataset.py

-    for i in range(4, 8):
-        assert values[i]["value"] == arr2[i - 4]
+    np.testing.assert_array_equal(
+        values, np.expand_dims(np.concatenate((arr1, arr2)), axis=1))


A happy accident of adding a __len__() definition to ArrowRow is that consuming rows from a NumPy Dataset now transparently removes the single-column table nesting.

… arg.

python/ray/data/dataset.py

python/ray/data/grouped_dataset.py

ericl · 2021-11-05T17:49:36Z

python/ray/data/tests/test_dataset.py

+    print(f"Seeding RNG for test_groupby_arrow_multicolumn with: {seed}")
+    random.seed(seed)
+    xs = list(range(100))
+    random.shuffle(xs)


For bonus randomness I'd also .repartition(random(0, 100)) to induce random partitioning.

Otherwise every partition would have 1 element only.

I opted to go with a [1, 10, 100] partitioning set, I think this should give us the best tradeoff of partitioning edge case coverage with test time.

python/ray/data/tests/test_dataset.py

…egations for a multi-column aggregation into a utility, shared by both Dataset and GroupedDataset.

…ioning.

…repartitioning." This reverts commit 99438ab.

ericl

I think the empty block handling shouldn't be done in repartition.

python/ray/data/dataset.py

python/ray/data/grouped_dataset.py

python/ray/data/dataset.py

clarkzinzow · 2021-11-12T15:39:46Z

@ericl Could you take another pass?

python/ray/data/aggregate.py

clarkzinzow · 2021-11-12T21:58:31Z

Datasets tests passed and other Python failures are unrelated, I think this is ready to merge cc @ericl

clarkzinzow requested review from ericl and scv119 as code owners November 4, 2021 19:27

clarkzinzow assigned ericl, scv119 and jjyao Nov 4, 2021

clarkzinzow changed the title ~~[Datasets] Add groupby multi-column/lambda aggregation~~ [Datasets] Add groupby multi-column/multi-lambda aggregation Nov 4, 2021

clarkzinzow mentioned this pull request Nov 4, 2021

[Datasets] Multi-aggregations [1/3]: Add basic support for groupby multi-aggregations. #20044

Merged

6 tasks

clarkzinzow force-pushed the datasets/feat/groupby-multicolumn-aggregation branch from 20ac5d8 to a392f06 Compare November 4, 2021 19:44

ericl reviewed Nov 4, 2021

View reviewed changes

python/ray/data/dataset.py Outdated Show resolved Hide resolved

python/ray/data/impl/block_list.py Show resolved Hide resolved

python/ray/data/read_api.py Show resolved Hide resolved

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Nov 4, 2021

clarkzinzow mentioned this pull request Nov 4, 2021

[Datasets] Multi-aggregations [3/3]: Add Pandas-like multi-aggregation API. #20090

Closed

6 tasks

clarkzinzow changed the title ~~[Datasets] Add groupby multi-column/multi-lambda aggregation~~ [Datasets] Multi-aggregations [2/3]: Add groupby multi-column/multi-lambda aggregation Nov 5, 2021

clarkzinzow force-pushed the datasets/feat/groupby-multicolumn-aggregation branch from a392f06 to db5129a Compare November 5, 2021 00:54

clarkzinzow commented Nov 5, 2021

View reviewed changes

clarkzinzow added 4 commits November 5, 2021 13:00

Add groupby multi-column/lambda aggregation + thorough checking of on…

fa83473

… arg.

New line in docstring.

333d6b1

Fix empty/cleared dataset check.

870bd03

Fix from_numpy test.

fac8259

clarkzinzow force-pushed the datasets/feat/groupby-multicolumn-aggregation branch from 8cee895 to fac8259 Compare November 5, 2021 13:01

clarkzinzow removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Nov 5, 2021

jjyao reviewed Nov 5, 2021

View reviewed changes

python/ray/data/dataset.py Outdated Show resolved Hide resolved

python/ray/data/grouped_dataset.py Outdated Show resolved Hide resolved

ericl reviewed Nov 5, 2021

View reviewed changes

python/ray/data/tests/test_dataset.py Show resolved Hide resolved

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Nov 5, 2021

clarkzinzow added 6 commits November 5, 2021 19:53

Remove stale to_arrow test.

0792035

Create is_arrow_dataset utility; fix .schema() null schema check.

7520857

Consolidate most shared logic for building the set of underlying aggr…

2386e8b

…egations for a multi-column aggregation into a utility, shared by both Dataset and GroupedDataset.

Truncate the number of partitions to the number of rows when repartit…

99438ab

…ioning.

Parametrize the number of partitions/blocks in groupby tests.

5c1f99e

Almost-comparison for std.

d942bbf

jjyao added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Nov 8, 2021

clarkzinzow added 2 commits November 9, 2021 16:30

Revert "Truncate the number of partitions to the number of rows when …

a776e26

…repartitioning." This reverts commit 99438ab.

Properly handle empty blocks when repartitioning.

1246bf5

clarkzinzow requested a review from jjyao November 9, 2021 22:18

clarkzinzow removed tests-ok The tagger certifies test failures are unrelated and assumes personal liability. @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. labels Nov 9, 2021

ericl requested changes Nov 9, 2021

View reviewed changes

python/ray/data/dataset.py Show resolved Hide resolved

python/ray/data/dataset.py Show resolved Hide resolved

python/ray/data/grouped_dataset.py Show resolved Hide resolved

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Nov 9, 2021

jjyao approved these changes Nov 9, 2021

View reviewed changes

python/ray/data/dataset.py Show resolved Hide resolved

clarkzinzow force-pushed the datasets/feat/groupby-multicolumn-aggregation branch 7 times, most recently from 79a068f to 8399bd1 Compare November 11, 2021 00:12

Improved docstrings.

9c467c1

clarkzinzow force-pushed the datasets/feat/groupby-multicolumn-aggregation branch from 8399bd1 to 9c467c1 Compare November 12, 2021 15:36

clarkzinzow added tests-ok The tagger certifies test failures are unrelated and assumes personal liability. and removed @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. labels Nov 12, 2021

jjyao reviewed Nov 12, 2021

View reviewed changes

python/ray/data/aggregate.py Outdated Show resolved Hide resolved

ericl added @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. and removed tests-ok The tagger certifies test failures are unrelated and assumes personal liability. labels Nov 12, 2021

Fix AggregateOnT type.

944d309

clarkzinzow added tests-ok The tagger certifies test failures are unrelated and assumes personal liability. and removed @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. labels Nov 12, 2021

ericl approved these changes Nov 12, 2021

View reviewed changes

ericl merged commit 918a215 into ray-project:master Nov 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Datasets] Multi-aggregations [2/3]: Add groupby multi-column/multi-lambda aggregation #20074

[Datasets] Multi-aggregations [2/3]: Add groupby multi-column/multi-lambda aggregation #20074

clarkzinzow commented Nov 4, 2021 •

edited

Loading

clarkzinzow Nov 5, 2021

ericl Nov 5, 2021

ericl Nov 5, 2021

clarkzinzow Nov 6, 2021 •

edited

Loading

ericl left a comment

clarkzinzow commented Nov 12, 2021

clarkzinzow commented Nov 12, 2021

[Datasets] Multi-aggregations [2/3]: Add groupby multi-column/multi-lambda aggregation #20074

[Datasets] Multi-aggregations [2/3]: Add groupby multi-column/multi-lambda aggregation #20074

Conversation

clarkzinzow commented Nov 4, 2021 • edited Loading

Mean on 3 columns - old API

Mean on 3 columns - new API

Drivebys

Checks

clarkzinzow Nov 5, 2021

Choose a reason for hiding this comment

ericl Nov 5, 2021

Choose a reason for hiding this comment

ericl Nov 5, 2021

Choose a reason for hiding this comment

clarkzinzow Nov 6, 2021 • edited Loading

Choose a reason for hiding this comment

ericl left a comment

Choose a reason for hiding this comment

clarkzinzow commented Nov 12, 2021

clarkzinzow commented Nov 12, 2021

clarkzinzow commented Nov 4, 2021 •

edited

Loading

clarkzinzow Nov 6, 2021 •

edited

Loading