-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add ak.drop_none() #1904
feat: add ak.drop_none() #1904
Conversation
db3a86f
to
ef611f4
Compare
a625376
to
76e72c9
Compare
28c7ad7
to
c671c46
Compare
Codecov Report
Additional details and impacted files
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good implementation and I can see that the tests are exhaustive, using the full suite of layouts.
I think it's good and ready to merge, as soon as the build-docs issue is fixed. If it's fixed in #1905 first, we'll merge that into main and then merge main into here. (One way or the other.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work @ioanaif!! I needed this feature so much during my analysis work.
I'm on holiday, but I know that this is a big feature and wanted to try and offer an additional set of eyes.
I'm done touching this PR now (hands off) - I fixed the whitespace in my suggestion that didn't merge properly :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked at the tests too hastily: this is not wrapping its result as a high-level array, which it should by default.
There's also an error (below) that I noticed when trying the RecordArray example, which should be possible. I don't know why RecordArray reveals it but NumpyArray doesn't. I'm looking more closely at this now...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I'm sorry, but I missed some things. The to_list
tests in test_1904-drop-none.py are insensitive to the difference between high-level and low-level arrays, so that's why we didn't catch that issue.
I suspect that having _drop_none
return different types for different nodes (Content
vs (Index, Content)
) is probably the indirect cause, though I don't know what mechanism is responsible for it. Anyway, it will be safer to have all the _drop_none
methods return the same type; the top-level drop_none
(no underscore) is in a good position to drop the unnecessary outindex
at the end.
Also, this should at least pass through records: when users want to remove missing values, they'll want to remove them from within records just as much as from recordless arrays. A good guide for this could be is_none
, which also looks inside of arrays, and is_none
and drop_none
should be pretty similar.
>>> ak.is_none(ak.Array([[{"x": [1]}], [{"x": [None]}]]), axis=2).show()
[[{x: [False]}],
[{x: [True]}]]
>>> ak.is_none(ak.Array([[{"x": [1], "y": [[2]]}], [{"x": [None], "y": [[None]]}]]), axis=-1).show()
[[{x: False, y: [False]}],
[{x: False, y: [False]}]]
In the above, x
and y
have lists of different depths, but axis=-1
counts up from the bottom. That's why most functions keep calling wrap_axis_if_negative
until it finally becomes positive (after the recursion has passed through the record and it's seeing a single, unambiguous depth).
I think these are the only two issues, though. There are tests in test_1904-drop-none.py that involve records; I'll take a look at what happened there now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like it's almost completely done. I see that you're using the new maybe_posaxis
.
There are some cases that the tests didn't (couldn't) cover because the tests are based on slicing with is_none
, which isn't possible if a record introduces different depths. I've worked through some examples that could be used as direct tests.
Starting from this array
, which has missing values at all levels, with different levels for the x
and y
branches of the record.
array = ak.Array(
[None,
[None, {"x": [1], "y": [[2]]}],
[{"x": [3], "y": [None]}, {"x": [None], "y": [[None]]}]
])
>>> array.show()
[None,
[None, {x: [1], y: [[2]]}],
[{x: [3], y: [None]}, {x: [None], y: [[None]]}]]
First, with axis=0
, the result from is_none
and drop_none
are correct because only the None
outside of any lists is at axis=0
.
>>> ak.is_none(array, axis=0).show()
[True,
False,
False]
>>> ak.drop_none(array, axis=0).show()
[[None, {x: [1], y: [[2]]}],
[{x: [3], y: [None]}, {x: [None], y: [[None]]}]]
Next, with axis=1
, is_none
and drop_none
are still both correct.
>>> ak.is_none(array, axis=1).show()
[None,
[True, False],
[False, False]]
>>> ak.drop_none(array, axis=1).show()
[None,
[{x: [1], y: [[2]]}],
[{x: [3], y: [None]}, {x: [None], y: [[None]]}]]
Next, with axis=2
, is_none
is correct because it is only making x: [True]
or y: [True]
for a None
at this level; anything deeper is hidden inside a False
and anything shallower is visible, so is_none
is correct.
However, drop_none
is producing the right values but putting them in the wrong places: the x: [3]
should be in the previous record, so drop_none
is incorrect in this example.
>>> ak.is_none(array, axis=2).show()
[None,
[None, {x: [False], y: [False]}],
[{x: [False], y: [True]}, {x: [True], y: [False]}]]
>>> ak.drop_none(array, axis=2).show()
[None,
[None, {x: [1], y: [[2]]}],
[{x: [], y: []}, {x: [3], y: [[None]]}]]
Next, with axis=-1
, the deepest x
values have one [
, ]
while the deepest y
values have two [[
, ]]
. The is_none
function reports x
to be [False]
or [True]
with one bracket, and it reports y
to be [[False]]
or [[True]]
with two brackets, so is_none
is correct.
The drop_none
correctly leaves the y: [None]
(one bracket) because it's above axis=-1
and it keeps x: [3]
or removes it, x: []
, also with one bracket because that's what axis=-1
means for x
. However, the x: [3]
should be on the previous record, so it's incorrect for this example.
>>> ak.is_none(array, axis=-1).show()
[None,
[None, {x: [False], y: [[False]]}],
[{x: [False], y: [None]}, {x: [True], y: [[True]]}]]
>>> ak.drop_none(array, axis=-1).show()
[None,
[None, {x: [1], y: [[2]]}],
[{x: [], y: [None]}, {x: [3], y: [[]]}]]
Finally, with axis=-2
, is_none
should say x: False
and (potentially) x: True
(no instances in this array
) with zero brackets and y: [False]
and y: [True]
with one bracket. It does, and is_none
is correct.
Thinking about what drop_none
should do in this case, it can't remove an x
field without also removing a y
field, so axis=-2
should raise an exception for this kind of thing, and it doesn't:
>>> ak.is_none(array, axis=-2).show()
[None,
[None, {x: False, y: [False]}],
[{x: False, y: [True]}, {x: False, y: [False]}]]
>>> ak.drop_none(array, axis=-2).show()
[None,
[None, {x: [1], y: [[2]]}],
[{x: [3], y: []}, {x: [], y: [[None]]}]]
To be concrete: the original array
has x: [1]
, x: [3]
, and x: [None]
at this axis
, all of which are not missing, but if one was, then it would want to remove that x
value, yet axis=-2
for y
does not mean for it to act at this level. You can implement this exception by adding code where the recursion gets to the RecordArray: if posaxis == depth - 1
for some of its fields but not others, then it should raise a np.AxisError
.
(is_none
raises an np.AxisError
if axis=-3
, which is impossible for what is_none
wants to do.)
I need a better array2
for this case, so that I'm not talking in hypotheticals.
array2 = ak.Array(
[None,
[None, {x: [1], y: [[2]]}],
[{x: None, y: [None]}, {x: [None], y: [[None]]}]
])
>>> array2.show()
[None,
[None, {x: [1], y: [[2]]}],
[{x: None, y: [None]}, {x: [None], y: [[None]]}]]
Now is_none
at axis=-2
is
>>> ak.is_none(array2, axis=-2).show()
[None,
[None, {x: False, y: [False]}],
[{x: True, y: [True]}, {x: False, y: [False]}]]
which has x: False
wherever x
has a list and x: True
wherever x
has a direct None
.
Trying to run drop_none
on this raises a ValueError
in the RecordArray constructor, but it should have been an np.AxisError
before attempting to compute and construct the RecordArray.
These can be directly dropped in as new unit tests. This PR should also increase the awkward-cpp
version number to 3
in awkward-cpp/pyproject.toml
.
I think the above is just one error in indexing plus needing to add one check-and-raise-exception. It's an enormous amount of work to get this far.
Actually, I'm going to stick to the habit of always changing version numbers as direct commits to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed some docs fixes, but any code suggestions I've left here! Nice work @ioanaif :)
…ror, as they should.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is done! (Let's see the tests pass.)
Thanks for all of the hard work on this; it was definitely a lot more involved than I had thought it was going to be when I first suggested it. But now it works in all the extreme cases and we know it's not going to come up again as a bug in somebody's analysis.
It reminds me of something... found it:
This applies a lot more often than I'd like it to.
Yay! Hope all corner cases have been discovered! 🥳🥳 |
This PR adds the
drop_none
functionality. Requested in #832ak.drop_none(array, axis)
- removes missing values (None) from a given array.For example, in the following array,
a = ak.Array([[[0]], [[None]], [[1], None], [[2, None]]])
The None values will be removed, resulting in
The default axis is None. However, an axis can be specified: