Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Datasets] Multi-aggregations [2/3]: Add groupby multi-column/multi-lambda aggregation #20074

Conversation

clarkzinzow
Copy link
Contributor

@clarkzinzow clarkzinzow commented Nov 4, 2021

As a stacked follow-up to #20044 (only the last commit contains incremental changes from original PR), this PR adds multi-column/multi-lambda aggregations, making it much easier to express "do some aggregation on multiple columns".

Mean on 3 columns - old API

from ray.data.aggregate import Mean

ds.aggregate(Mean("A"), Mean("B"), Mean("C"))

Mean on 3 columns - new API

ds.mean(["A", "B", "C"])

This PR also adds thorough checking of the on aggregation argument.

Drivebys

  1. .repartition() wasn't properly handling the num_partitions > num_rows case: for a simple Dataset with simple blocks, the generated empty blocks were Arrow blocks instead of simple blocks, which causes most downstream operations to break. We explicitly handle the empty block case in .repartition(). We should find a way to fix this in general, so I opened an issue for that effort.

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@clarkzinzow clarkzinzow changed the title [Datasets] Add groupby multi-column/lambda aggregation [Datasets] Add groupby multi-column/multi-lambda aggregation Nov 4, 2021
@clarkzinzow clarkzinzow force-pushed the datasets/feat/groupby-multicolumn-aggregation branch from 20ac5d8 to a392f06 Compare November 4, 2021 19:44
python/ray/data/dataset.py Outdated Show resolved Hide resolved
python/ray/data/impl/block_list.py Show resolved Hide resolved
python/ray/data/read_api.py Show resolved Hide resolved
@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Nov 4, 2021
@clarkzinzow clarkzinzow changed the title [Datasets] Add groupby multi-column/multi-lambda aggregation [Datasets] Multi-aggregations [2/3]: Add groupby multi-column/multi-lambda aggregation Nov 5, 2021
@clarkzinzow clarkzinzow force-pushed the datasets/feat/groupby-multicolumn-aggregation branch from a392f06 to db5129a Compare November 5, 2021 00:54
for i in range(4, 8):
assert values[i]["value"] == arr2[i - 4]
np.testing.assert_array_equal(
values, np.expand_dims(np.concatenate((arr1, arr2)), axis=1))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A happy accident of adding a __len__() definition to ArrowRow is that consuming rows from a NumPy Dataset now transparently removes the single-column table nesting.

@clarkzinzow clarkzinzow force-pushed the datasets/feat/groupby-multicolumn-aggregation branch from 8cee895 to fac8259 Compare November 5, 2021 13:01
@clarkzinzow clarkzinzow removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Nov 5, 2021
python/ray/data/dataset.py Outdated Show resolved Hide resolved
python/ray/data/grouped_dataset.py Outdated Show resolved Hide resolved
print(f"Seeding RNG for test_groupby_arrow_multicolumn with: {seed}")
random.seed(seed)
xs = list(range(100))
random.shuffle(xs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For bonus randomness I'd also .repartition(random(0, 100)) to induce random partitioning.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise every partition would have 1 element only.

Copy link
Contributor Author

@clarkzinzow clarkzinzow Nov 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opted to go with a [1, 10, 100] partitioning set, I think this should give us the best tradeoff of partitioning edge case coverage with test time.

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Nov 5, 2021
@jjyao jjyao added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Nov 8, 2021
@clarkzinzow clarkzinzow removed tests-ok The tagger certifies test failures are unrelated and assumes personal liability. @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. labels Nov 9, 2021
Copy link
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the empty block handling shouldn't be done in repartition.

python/ray/data/dataset.py Show resolved Hide resolved
python/ray/data/dataset.py Show resolved Hide resolved
python/ray/data/grouped_dataset.py Show resolved Hide resolved
@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Nov 9, 2021
python/ray/data/dataset.py Show resolved Hide resolved
@clarkzinzow clarkzinzow force-pushed the datasets/feat/groupby-multicolumn-aggregation branch 7 times, most recently from 79a068f to 8399bd1 Compare November 11, 2021 00:12
@clarkzinzow clarkzinzow force-pushed the datasets/feat/groupby-multicolumn-aggregation branch from 8399bd1 to 9c467c1 Compare November 12, 2021 15:36
@clarkzinzow
Copy link
Contributor Author

@ericl Could you take another pass?

@clarkzinzow clarkzinzow added tests-ok The tagger certifies test failures are unrelated and assumes personal liability. and removed @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. labels Nov 12, 2021
python/ray/data/aggregate.py Outdated Show resolved Hide resolved
@ericl ericl added @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. and removed tests-ok The tagger certifies test failures are unrelated and assumes personal liability. labels Nov 12, 2021
@clarkzinzow clarkzinzow added tests-ok The tagger certifies test failures are unrelated and assumes personal liability. and removed @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. labels Nov 12, 2021
@clarkzinzow
Copy link
Contributor Author

Datasets tests passed and other Python failures are unrelated, I think this is ready to merge cc @ericl

@ericl ericl merged commit 918a215 into ray-project:master Nov 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants