[Datasets] Support ignoring NaNs in aggregations. #20787

clarkzinzow · 2021-11-30T04:15:26Z

Adds support for ignoring NaNs in aggregations. NaNs will now be ignored by default, and the user can pass in ds.mean("A", ignore_nulls=False) if they would rather have the NaN be propagated to the output. Specifically, we'd have the following null-handling semantics:

Mix of values and nulls - ignore_nulls=True: Ignore the nulls, return aggregation of values
Mix of values and nulls - ignore_nulls=False: Return None
All nulls: Return None
Empty dataset: Return None

This all null and empty dataset handling matches the semantics of NumPy and Pandas.

TODOs:

Add test coverage for the rest of the aggregations.

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

python/ray/data/aggregate.py

ericl

Oops forgot to submit the comments.

python/ray/data/aggregate.py

python/ray/data/grouped_dataset.py

python/ray/data/aggregate.py

python/ray/data/tests/test_dataset.py

ericl · 2021-12-18T05:55:35Z

Ping

bveeramani · 2022-01-30T05:23:02Z

‼️ ACTION REQUIRED ‼️

We've switched our code formatter from YAPF to Black (see #21311).

To prevent issues with merging your code, here's what you'll need to do:

Install Black

pip install -I black==21.12b0

Format changed files with Black

curl -o format-changed.sh https://gist.githubusercontent.com/bveeramani/42ef0e9e387b755a8a735b084af976f2/raw/7631276790765d555c423b8db2b679fd957b984a/format-changed.sh
chmod +x ./format-changed.sh
./format-changed.sh
rm format-changed.sh

Commit your changes.

git add --all
git commit -m "Format Python code with Black"

Merge master into your branch.

git pull upstream master

Resolve merge conflicts (if necessary).

After running these steps, you'll have the updated format.sh.

clarkzinzow · 2022-02-02T02:28:00Z

@ericl @jjyao This is ready for another pass!

python/ray/data/dataset.py

jjyao

lgtm

python/ray/data/aggregate.py

python/ray/data/dataset.py

python/ray/data/aggregate.py

clarkzinzow · 2022-02-08T02:36:34Z

Datasets tests pass so this is ready to be merged, unless @ericl wants to take a pass.

python/ray/data/aggregate.py

ericl · 2022-02-08T02:46:00Z

python/ray/data/aggregate.py

+    a = init(k)
+    if not isinstance(a, list):
+        a = [a]
+    return a + [0]


Could we add a type alias for the return type here? Like NullableValue = Tuple[T, bool] and use that throughout?

There doesn't appear to be a straightforward such type to use, since it's actually closer to MaybeAgg = AggType + Tuple[int] (if AggType were a tuple), which I don't think that we can easily represent as a type. 🤔

python/ray/data/aggregate.py

clarkzinzow · 2022-02-09T04:13:23Z

Tests appear to be ok, the docs build failure looks to be transient, not sure how to trigger a rebuild there.

…t#20787)" This reverts commit f264cf8.

…22258) This reverts commit f264cf8.

…y-project#20787)" (ray-project#22258)" This reverts commit d295a9d.

Reverts #22258, unreverting #20787. The fix is in the ["Fix tests" commit](b559da2), where we switch to using the test utility DataFrame equality comparison which properly handles NaN comparisons. The underling cause of this test break is explained [here](#22258 (comment)).

Adds support for ignoring NaNs in aggregations. NaNs will now be ignored by default, and the user can pass in `ds.mean("A", ignore_nulls=False)` if they would rather have the NaN be propagated to the output. Specifically, we'd have the following null-handling semantics: 1. Mix of values and nulls - `ignore_nulls`=True: Ignore the nulls, return aggregation of values 2. Mix of values and nulls - `ignore_nulls`=False: Return `None` 3. All nulls: Return `None` 4. Empty dataset: Return `None` This all null and empty dataset handling matches the semantics of NumPy and Pandas.

…t#20787)" (ray-project#22258) This reverts commit f264cf8.

Reverts ray-project#22258, unreverting ray-project#20787. The fix is in the ["Fix tests" commit](ray-project@b559da2), where we switch to using the test utility DataFrame equality comparison which properly handles NaN comparisons. The underling cause of this test break is explained [here](ray-project#22258 (comment)).

clarkzinzow requested review from ericl and scv119 as code owners November 30, 2021 04:15

ericl self-assigned this Nov 30, 2021

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Nov 30, 2021

clarkzinzow force-pushed the datasets/feat/aggregations-ignore-nans branch from a4c480a to f88f1cd Compare November 30, 2021 23:44

clarkzinzow assigned scv119 and jjyao Nov 30, 2021

clarkzinzow removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Nov 30, 2021

clarkzinzow changed the title ~~[Datasets] [BLOCKED] Support ignoring NaNs in aggregations.~~ [Datasets] Support ignoring NaNs in aggregations. Nov 30, 2021

ericl reviewed Dec 1, 2021

View reviewed changes

python/ray/data/aggregate.py Outdated Show resolved Hide resolved

python/ray/data/aggregate.py Outdated Show resolved Hide resolved

ericl reviewed Dec 1, 2021

View reviewed changes

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Dec 1, 2021

jjyao reviewed Dec 1, 2021

View reviewed changes

python/ray/data/aggregate.py Outdated Show resolved Hide resolved

python/ray/data/grouped_dataset.py Show resolved Hide resolved

jjyao reviewed Dec 1, 2021

View reviewed changes

python/ray/data/aggregate.py Outdated Show resolved Hide resolved

python/ray/data/aggregate.py Outdated Show resolved Hide resolved

python/ray/data/tests/test_dataset.py Show resolved Hide resolved

clarkzinzow force-pushed the datasets/feat/aggregations-ignore-nans branch 3 times, most recently from 96f0df3 to 23d17b1 Compare February 2, 2022 02:27

clarkzinzow requested review from jjyao and ericl February 2, 2022 02:28

Support ignoring nans in aggregations.

d20dca4

clarkzinzow force-pushed the datasets/feat/aggregations-ignore-nans branch from 23d17b1 to d20dca4 Compare February 2, 2022 20:47

jjyao reviewed Feb 2, 2022

View reviewed changes

python/ray/data/dataset.py Show resolved Hide resolved

clarkzinzow added 2 commits February 3, 2022 19:12

Fix ignore_nulls documentation.

f820570

Fix test_split

5cb79a9

jjyao reviewed Feb 3, 2022

View reviewed changes

python/ray/data/aggregate.py Outdated Show resolved Hide resolved

python/ray/data/aggregate.py Outdated Show resolved Hide resolved

python/ray/data/dataset.py Outdated Show resolved Hide resolved

Remove empty sentinel.

b18f002

clarkzinzow force-pushed the datasets/feat/aggregations-ignore-nans branch from e878fcb to b18f002 Compare February 7, 2022 19:46

Fix Optional[] use, fix ignore_nulls docstrings.

974be59

clarkzinzow requested a review from jjyao February 7, 2022 19:55

clarkzinzow removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Feb 7, 2022

Cache Pandas import.

78d99cf

jjyao approved these changes Feb 7, 2022

View reviewed changes

python/ray/data/aggregate.py Outdated Show resolved Hide resolved

Fix finalizer return type.

3b4d30e

ericl reviewed Feb 8, 2022

View reviewed changes

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Feb 8, 2022

PR feedback.

4d058d0

clarkzinzow requested review from jjyao and ericl February 9, 2022 04:11

clarkzinzow removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Feb 9, 2022

clarkzinzow added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Feb 9, 2022

ericl approved these changes Feb 9, 2022

View reviewed changes

ericl merged commit f264cf8 into ray-project:master Feb 9, 2022

jjyao added a commit to jjyao/ray that referenced this pull request Feb 9, 2022

Revert "[Datasets] Support ignoring NaNs in aggregations. (ray-projec…

d79257a

…t#20787)" This reverts commit f264cf8.

scv119 pushed a commit that referenced this pull request Feb 10, 2022

Revert "[Datasets] Support ignoring NaNs in aggregations. (#20787)" (#…

d295a9d

…22258) This reverts commit f264cf8.

clarkzinzow added a commit to clarkzinzow/ray that referenced this pull request Feb 10, 2022

Revert "Revert "[Datasets] Support ignoring NaNs in aggregations. (ra…

7329b9a

…y-project#20787)" (ray-project#22258)" This reverts commit d295a9d.

clarkzinzow mentioned this pull request Feb 10, 2022

[Datasets] Unrevert NaN handling. #22291

Merged

6 tasks

simonsays1980 pushed a commit to simonsays1980/ray that referenced this pull request Feb 27, 2022

Revert "[Datasets] Support ignoring NaNs in aggregations. (ray-projec…

eb6d29a

…t#20787)" (ray-project#22258) This reverts commit f264cf8.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Datasets] Support ignoring NaNs in aggregations. #20787

[Datasets] Support ignoring NaNs in aggregations. #20787

clarkzinzow commented Nov 30, 2021 •

edited

Loading

ericl left a comment

ericl commented Dec 18, 2021

bveeramani commented Jan 30, 2022

clarkzinzow commented Feb 2, 2022

jjyao left a comment

clarkzinzow commented Feb 8, 2022

ericl Feb 8, 2022

clarkzinzow Feb 9, 2022

clarkzinzow commented Feb 9, 2022

[Datasets] Support ignoring NaNs in aggregations. #20787

[Datasets] Support ignoring NaNs in aggregations. #20787

Conversation

clarkzinzow commented Nov 30, 2021 • edited Loading

TODOs:

Checks

ericl left a comment

Choose a reason for hiding this comment

ericl commented Dec 18, 2021

bveeramani commented Jan 30, 2022

‼️ ACTION REQUIRED ‼️

clarkzinzow commented Feb 2, 2022

jjyao left a comment

Choose a reason for hiding this comment

clarkzinzow commented Feb 8, 2022

ericl Feb 8, 2022

Choose a reason for hiding this comment

clarkzinzow Feb 9, 2022

Choose a reason for hiding this comment

clarkzinzow commented Feb 9, 2022

clarkzinzow commented Nov 30, 2021 •

edited

Loading