Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Datasets] Support ignoring NaNs in aggregations. #20787

Merged

Conversation

clarkzinzow
Copy link
Contributor

@clarkzinzow clarkzinzow commented Nov 30, 2021

Adds support for ignoring NaNs in aggregations. NaNs will now be ignored by default, and the user can pass in ds.mean("A", ignore_nulls=False) if they would rather have the NaN be propagated to the output. Specifically, we'd have the following null-handling semantics:

  1. Mix of values and nulls - ignore_nulls=True: Ignore the nulls, return aggregation of values
  2. Mix of values and nulls - ignore_nulls=False: Return None
  3. All nulls: Return None
  4. Empty dataset: Return None

This all null and empty dataset handling matches the semantics of NumPy and Pandas.

TODOs:

  • Add test coverage for the rest of the aggregations.

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@ericl ericl self-assigned this Nov 30, 2021
@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Nov 30, 2021
@clarkzinzow clarkzinzow force-pushed the datasets/feat/aggregations-ignore-nans branch from a4c480a to f88f1cd Compare November 30, 2021 23:44
@clarkzinzow clarkzinzow removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Nov 30, 2021
@clarkzinzow clarkzinzow changed the title [Datasets] [BLOCKED] Support ignoring NaNs in aggregations. [Datasets] Support ignoring NaNs in aggregations. Nov 30, 2021
python/ray/data/aggregate.py Outdated Show resolved Hide resolved
python/ray/data/aggregate.py Outdated Show resolved Hide resolved
Copy link
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops forgot to submit the comments.

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Dec 1, 2021
python/ray/data/aggregate.py Outdated Show resolved Hide resolved
python/ray/data/grouped_dataset.py Show resolved Hide resolved
python/ray/data/aggregate.py Outdated Show resolved Hide resolved
python/ray/data/aggregate.py Outdated Show resolved Hide resolved
python/ray/data/tests/test_dataset.py Show resolved Hide resolved
@ericl
Copy link
Contributor

ericl commented Dec 18, 2021

Ping

@bveeramani
Copy link
Member

‼️ ACTION REQUIRED ‼️

We've switched our code formatter from YAPF to Black (see #21311).

To prevent issues with merging your code, here's what you'll need to do:

  1. Install Black
pip install -I black==21.12b0
  1. Format changed files with Black
curl -o format-changed.sh https://gist.githubusercontent.com/bveeramani/42ef0e9e387b755a8a735b084af976f2/raw/7631276790765d555c423b8db2b679fd957b984a/format-changed.sh
chmod +x ./format-changed.sh
./format-changed.sh
rm format-changed.sh
  1. Commit your changes.
git add --all
git commit -m "Format Python code with Black"
  1. Merge master into your branch.
git pull upstream master
  1. Resolve merge conflicts (if necessary).

After running these steps, you'll have the updated format.sh.

@clarkzinzow clarkzinzow force-pushed the datasets/feat/aggregations-ignore-nans branch 3 times, most recently from 96f0df3 to 23d17b1 Compare February 2, 2022 02:27
@clarkzinzow
Copy link
Contributor Author

@ericl @jjyao This is ready for another pass!

@clarkzinzow clarkzinzow force-pushed the datasets/feat/aggregations-ignore-nans branch from 23d17b1 to d20dca4 Compare February 2, 2022 20:47
Copy link
Collaborator

@jjyao jjyao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

python/ray/data/aggregate.py Outdated Show resolved Hide resolved
python/ray/data/aggregate.py Outdated Show resolved Hide resolved
python/ray/data/dataset.py Outdated Show resolved Hide resolved
@clarkzinzow clarkzinzow force-pushed the datasets/feat/aggregations-ignore-nans branch from e878fcb to b18f002 Compare February 7, 2022 19:46
@clarkzinzow clarkzinzow removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Feb 7, 2022
python/ray/data/aggregate.py Outdated Show resolved Hide resolved
@clarkzinzow
Copy link
Contributor Author

Datasets tests pass so this is ready to be merged, unless @ericl wants to take a pass.

python/ray/data/aggregate.py Outdated Show resolved Hide resolved
a = init(k)
if not isinstance(a, list):
a = [a]
return a + [0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add a type alias for the return type here? Like NullableValue = Tuple[T, bool] and use that throughout?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There doesn't appear to be a straightforward such type to use, since it's actually closer to MaybeAgg = AggType + Tuple[int] (if AggType were a tuple), which I don't think that we can easily represent as a type. 🤔

python/ray/data/aggregate.py Outdated Show resolved Hide resolved
python/ray/data/aggregate.py Outdated Show resolved Hide resolved
python/ray/data/aggregate.py Outdated Show resolved Hide resolved
@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Feb 8, 2022
@clarkzinzow clarkzinzow removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Feb 9, 2022
@clarkzinzow
Copy link
Contributor Author

Tests appear to be ok, the docs build failure looks to be transient, not sure how to trigger a rebuild there.

@clarkzinzow clarkzinzow added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Feb 9, 2022
@ericl ericl merged commit f264cf8 into ray-project:master Feb 9, 2022
jjyao added a commit to jjyao/ray that referenced this pull request Feb 9, 2022
scv119 pushed a commit that referenced this pull request Feb 10, 2022
clarkzinzow added a commit to clarkzinzow/ray that referenced this pull request Feb 10, 2022
ericl pushed a commit that referenced this pull request Feb 11, 2022
Reverts #22258, unreverting #20787. 

The fix is in the ["Fix tests" commit](b559da2), where we switch to using the test utility DataFrame equality comparison which properly handles NaN comparisons. The underling cause of this test break is explained [here](#22258 (comment)).
simonsays1980 pushed a commit to simonsays1980/ray that referenced this pull request Feb 27, 2022
Adds support for ignoring NaNs in aggregations. NaNs will now be ignored by default, and the user can pass in `ds.mean("A", ignore_nulls=False)` if they would rather have the NaN be propagated to the output. Specifically, we'd have the following null-handling semantics:
1. Mix of values and nulls - `ignore_nulls`=True: Ignore the nulls, return aggregation of values
2. Mix of values and nulls - `ignore_nulls`=False: Return `None`
3. All nulls: Return `None`
4. Empty dataset: Return `None`

This all null and empty dataset handling matches the semantics of NumPy and Pandas.
simonsays1980 pushed a commit to simonsays1980/ray that referenced this pull request Feb 27, 2022
simonsays1980 pushed a commit to simonsays1980/ray that referenced this pull request Feb 27, 2022
Reverts ray-project#22258, unreverting ray-project#20787. 

The fix is in the ["Fix tests" commit](ray-project@b559da2), where we switch to using the test utility DataFrame equality comparison which properly handles NaN comparisons. The underling cause of this test break is explained [here](ray-project#22258 (comment)).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants