fix: consolidate regular indexing #1943

agoose77 · 2022-12-03T14:52:55Z

Fixes #1358 by using maybe_to_NumpyArray. RegularArrays that succeed with maybe_to_NumpyArray() follow NumPy indexing. Previously, they followed Awkward Indexing.

📚 The documentation for this PR will be available at https://awkward-array.readthedocs.io/en/agoose77-fix-regular-indexing/ once Read the Docs has finished building 🔨

…nly be idempotent for 1D arrays

agoose77 · 2022-12-03T16:53:31Z

@jpivarski this PR touches the indexing logic, and needs a careful eye.

It makes two big changes:

fields are now only selected if a list of strings is passed, rather than any iterable. This prevents arrays of strings selecting fields. This also means that an empty list behaves like an empty array, which was already the case.
RegularArrays that succeed with maybe_to_NumpyArray() follow NumPy indexing. Previously, they followed Awkward Indexing.

(2) is the tricky point. First, a short history recap:

v1 prevented mixed regular-ragged layouts, leading to argmin/max with keepdims=True returns regulararray #434, so keepdims=True was changed to return the same type as the layout.
v2 removed this restriction.
feat: add RegularArray._reduce_next implementation #1811 made keepdims=True always return a 1 dimension (I.e. reverting the fix to argmin/max with keepdims=True returns regulararray #434)

This PR addresses #1358, which exposes the lack of symmetry between NumpyArray and RegularArray for indexing.
The fix for this issue is simple - make RegularArray agree with NumpyArray. However, this means users need to be careful. Consider the following:

>>> a = ak.Array([[0, 1, 2], [3, 4], [5]])
>>> a[ak.argmin(a, axis=1, keepdims=True)]
<Array [[0], [3], [5]] type='3 * var * ?int64'>
>>> a[ak.argmin(a, axis=1, keepdims=True, mask_identity=False)]
<Array [[[0, 1, 2]], [[0, ...]], [[0, 1, 2]]] type='3 * 1 * var * int64'>

The mask from mask_identity=False (case 1, implicit) means that the Awkward indexing is followed. Whereas in case 2 the mask is omitted and an identity used instead. Due to this, the array can trivially be converted to a NumPy array, and we use NumPy indexing.

I think the question here is not "is this correct?" because if

we want to support multiple indexing types, and
we decide upon which one to use according to the layout type

then the observed behaviour is consistent with this. It's only that, from a UX perspective, it's only the difference between ?int64 and int64 that means you get one or the other.

Are you comfortable with this policy? (And indeed, anyone else on the team!)

codecov · 2022-12-03T18:40:49Z

Codecov Report

Merging #1943 (5d38b4b) into main (3682de6) will increase coverage by 0.00%.
The diff coverage is 92.85%.

Additional details and impacted files

Impacted Files	Coverage Δ
src/awkward/contents/listoffsetarray.py	`79.53% <ø> (ø)`
src/awkward/contents/unionarray.py	`85.71% <ø> (ø)`
src/awkward/_slicing.py	`85.77% <90.00%> (-0.07%)`	⬇️
src/awkward/contents/content.py	`75.46% <100.00%> (+0.06%)`	⬆️
src/awkward/contents/indexedoptionarray.py	`88.92% <100.00%> (ø)`
src/awkward/contents/listarray.py	`90.21% <100.00%> (ø)`
src/awkward/contents/unmaskedarray.py	`66.23% <100.00%> (ø)`
src/awkward/contents/regulararray.py	`88.10% <0.00%> (+0.19%)`	⬆️

jpivarski

Actually, I intended for any Iterable of strings to count as a fields selection, including Awkward Arrays of strings. Empty Iterables are only ambiguous (field-selection or row-selection by integers?) if untyped, and Awkward Arrays are one way of providing a runtime type. NumPy arrays are another.

I'd prefer to keep that feature. I guess there weren't any tests preventing you from changing it, but I did check that it worked while developing it.

On the other example with the ?int64 vs int64 toggling Awkward and NumPy slicing, I can see why that happens and I agree that it's confusing. We might say some point have to deprecate that behavior (warning on NumPy-style slicing and then phase it out—forcing users to explicitly wrap as NumPy at some point...?), but not now—it's too close to release time and that would be a major, major change.

So let's leave the confusing but "correct" argmax behavior. I'd like to switch back to letting any Iterable of strings select fields, though, if it's not too much trouble.

agoose77 · 2022-12-03T20:30:16Z

On the other example with the ?int64 vs int64 toggling Awkward and NumPy slicing, I can see why that happens and I agree that it's confusing. We might say some point have to deprecate that behavior (warning on NumPy-style slicing and then phase it out—forcing users to explicitly wrap as NumPy at some point...?), but not now—it's too close to release time and that would be a major, major change.

As a precursor statement, I would not advocate changing slicing dramatically at this point. So, agreed! Also, I don't have a proposal here - this seems to me to be a fundamental constraint with our indexing; we support many indexing features, and they are not mutually exclusive, so we have to choose according to some scheme.

My long-running feeling has generally been that it's better to have useful type information (e.g. "I reduced this axis, and so I have a length-1 dimension") over pandering to the shortfalls of our indexing mechanism (e.g. "I reduced a var dimension, so it stays var"). And, to be clear, I don't think any of this is "wrong" or "right", it just has its pros and cons.

We might say some point have to deprecate that behavior (warning on NumPy-style slicing and then phase it out—forcing users to explicitly wrap as NumPy at some point...?), but not now—it's too close to release time and that would be a major, major change.

Actually, this seems like the "best" solution to me in the long-long term. Introducing a new accessor like .at would make a lot of code less readable long-term, and I don't think most people are likely to want NumPy indexing. Awkward's indexing is more powerful for the kinds of data we work with. We can discuss this after the release; I'll open an issue.

I raised this just to make sure I'm not doing anything daft — this is very fundamental code I'm changing (fixing), and I wanted to make sure that we're all on the same page.

This reverts commit b4456fc.

agoose77 · 2022-12-03T20:36:35Z

I've reverted b4456fc and added new tests that enshrine this behavior :)

jpivarski

Thanks for reverting the Awkward Array of strings behavior.

I approve the intention of this PR, and I have a question about only one line of code. When you've answered it for yourself, you can merge.

I can't check the code deeply, but that one line of code was the only one that looks suspicious to me.

src/awkward/_slicing.py

agoose77 added 8 commits December 3, 2022 12:33

test: add test case

f0c7498

refactor: normalise_item_RegularArray_to_ListOffsetArray64 should o…

a90dad2

…nly be idempotent for 1D arrays

refactor: add notes about indexing

31d7571

chore: remove surplus getitem case

819d0a3

test: correct test case!

8b4bacb

refactor: directly use _to_numpy

9313c37

fix: normalise purelist_regular RegularArrays to NumPy arrays

92f4cd5

docs: add comment about types

b9baaf0

agoose77 added 4 commits December 3, 2022 17:19

fix: better support mask handling

91ce4c8

fix: require that fields are lists of strings

b4456fc

chore: rename test

d9bdf81

test: restore 434 test

25d1bbf

agoose77 requested a review from jpivarski December 3, 2022 19:12

agoose77 added the pr-next-release Required for the next release label Dec 3, 2022

jpivarski requested changes Dec 3, 2022

View reviewed changes

agoose77 added 2 commits December 3, 2022 20:31

Revert "fix: require that fields are lists of strings"

b243188

This reverts commit b4456fc.

test: ensure that we test all indexing cases

e513ad3

agoose77 requested a review from jpivarski December 3, 2022 20:36

jpivarski approved these changes Dec 3, 2022

View reviewed changes

src/awkward/_slicing.py Outdated Show resolved Hide resolved

fix: only return the data of NumpyArray objects

5d38b4b

agoose77 merged commit 65ffb92 into main Dec 3, 2022

agoose77 deleted the agoose77/fix-regular-indexing branch December 3, 2022 22:28

jpivarski removed the pr-next-release Required for the next release label Feb 15, 2023

agoose77 mentioned this pull request Jun 26, 2023

Reducers should preserve parameters #2544

Open

agoose77 mentioned this pull request Aug 3, 2023

Reducers (sum/any/all) convert axis to regular axis with keepdims=True #2609

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: consolidate regular indexing #1943

fix: consolidate regular indexing #1943

agoose77 commented Dec 3, 2022 •

edited

Loading

agoose77 commented Dec 3, 2022 •

edited

Loading

codecov bot commented Dec 3, 2022 •

edited

Loading

jpivarski left a comment

agoose77 commented Dec 3, 2022

agoose77 commented Dec 3, 2022

jpivarski left a comment

fix: consolidate regular indexing #1943

fix: consolidate regular indexing #1943

Conversation

agoose77 commented Dec 3, 2022 • edited Loading

agoose77 commented Dec 3, 2022 • edited Loading

codecov bot commented Dec 3, 2022 • edited Loading

Codecov Report

jpivarski left a comment

Choose a reason for hiding this comment

agoose77 commented Dec 3, 2022

agoose77 commented Dec 3, 2022

jpivarski left a comment

Choose a reason for hiding this comment

agoose77 commented Dec 3, 2022 •

edited

Loading

agoose77 commented Dec 3, 2022 •

edited

Loading

codecov bot commented Dec 3, 2022 •

edited

Loading