-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Datasets] Support ignoring NaNs in aggregations. #20787
[Datasets] Support ignoring NaNs in aggregations. #20787
Conversation
a4c480a
to
f88f1cd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops forgot to submit the comments.
Ping |
|
96f0df3
to
23d17b1
Compare
23d17b1
to
d20dca4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
e878fcb
to
b18f002
Compare
Datasets tests pass so this is ready to be merged, unless @ericl wants to take a pass. |
python/ray/data/aggregate.py
Outdated
a = init(k) | ||
if not isinstance(a, list): | ||
a = [a] | ||
return a + [0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add a type alias for the return type here? Like NullableValue = Tuple[T, bool]
and use that throughout?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There doesn't appear to be a straightforward such type to use, since it's actually closer to MaybeAgg = AggType + Tuple[int]
(if AggType
were a tuple), which I don't think that we can easily represent as a type. 🤔
Tests appear to be ok, the docs build failure looks to be transient, not sure how to trigger a rebuild there. |
…y-project#20787)" (ray-project#22258)" This reverts commit d295a9d.
Reverts #22258, unreverting #20787. The fix is in the ["Fix tests" commit](b559da2), where we switch to using the test utility DataFrame equality comparison which properly handles NaN comparisons. The underling cause of this test break is explained [here](#22258 (comment)).
Adds support for ignoring NaNs in aggregations. NaNs will now be ignored by default, and the user can pass in `ds.mean("A", ignore_nulls=False)` if they would rather have the NaN be propagated to the output. Specifically, we'd have the following null-handling semantics: 1. Mix of values and nulls - `ignore_nulls`=True: Ignore the nulls, return aggregation of values 2. Mix of values and nulls - `ignore_nulls`=False: Return `None` 3. All nulls: Return `None` 4. Empty dataset: Return `None` This all null and empty dataset handling matches the semantics of NumPy and Pandas.
…t#20787)" (ray-project#22258) This reverts commit f264cf8.
Reverts ray-project#22258, unreverting ray-project#20787. The fix is in the ["Fix tests" commit](ray-project@b559da2), where we switch to using the test utility DataFrame equality comparison which properly handles NaN comparisons. The underling cause of this test break is explained [here](ray-project#22258 (comment)).
Adds support for ignoring NaNs in aggregations. NaNs will now be ignored by default, and the user can pass in
ds.mean("A", ignore_nulls=False)
if they would rather have the NaN be propagated to the output. Specifically, we'd have the following null-handling semantics:ignore_nulls
=True: Ignore the nulls, return aggregation of valuesignore_nulls
=False: ReturnNone
None
None
This all null and empty dataset handling matches the semantics of NumPy and Pandas.
TODOs:
Checks
scripts/format.sh
to lint the changes in this PR.