Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: consolidate regular indexing #1943

Merged
merged 15 commits into from
Dec 3, 2022
Merged

Conversation

agoose77
Copy link
Collaborator

@agoose77 agoose77 commented Dec 3, 2022

Fixes #1358 by using maybe_to_NumpyArray. RegularArrays that succeed with maybe_to_NumpyArray() follow NumPy indexing. Previously, they followed Awkward Indexing.


📚 The documentation for this PR will be available at https://awkward-array.readthedocs.io/en/agoose77-fix-regular-indexing/ once Read the Docs has finished building 🔨

@agoose77
Copy link
Collaborator Author

agoose77 commented Dec 3, 2022

@jpivarski this PR touches the indexing logic, and needs a careful eye.

It makes two big changes:

  1. fields are now only selected if a list of strings is passed, rather than any iterable. This prevents arrays of strings selecting fields. This also means that an empty list behaves like an empty array, which was already the case.
  2. RegularArrays that succeed with maybe_to_NumpyArray() follow NumPy indexing. Previously, they followed Awkward Indexing.

(2) is the tricky point. First, a short history recap:

This PR addresses #1358, which exposes the lack of symmetry between NumpyArray and RegularArray for indexing.
The fix for this issue is simple - make RegularArray agree with NumpyArray. However, this means users need to be careful. Consider the following:

>>> a = ak.Array([[0, 1, 2], [3, 4], [5]])
>>> a[ak.argmin(a, axis=1, keepdims=True)]
<Array [[0], [3], [5]] type='3 * var * ?int64'>
>>> a[ak.argmin(a, axis=1, keepdims=True, mask_identity=False)]
<Array [[[0, 1, 2]], [[0, ...]], [[0, 1, 2]]] type='3 * 1 * var * int64'>

The mask from mask_identity=False (case 1, implicit) means that the Awkward indexing is followed. Whereas in case 2 the mask is omitted and an identity used instead. Due to this, the array can trivially be converted to a NumPy array, and we use NumPy indexing.

I think the question here is not "is this correct?" because if

  • we want to support multiple indexing types, and
  • we decide upon which one to use according to the layout type

then the observed behaviour is consistent with this. It's only that, from a UX perspective, it's only the difference between ?int64 and int64 that means you get one or the other.

Are you comfortable with this policy? (And indeed, anyone else on the team!)

@codecov
Copy link

codecov bot commented Dec 3, 2022

Codecov Report

Merging #1943 (5d38b4b) into main (3682de6) will increase coverage by 0.00%.
The diff coverage is 92.85%.

Additional details and impacted files
Impacted Files Coverage Δ
src/awkward/contents/listoffsetarray.py 79.53% <ø> (ø)
src/awkward/contents/unionarray.py 85.71% <ø> (ø)
src/awkward/_slicing.py 85.77% <90.00%> (-0.07%) ⬇️
src/awkward/contents/content.py 75.46% <100.00%> (+0.06%) ⬆️
src/awkward/contents/indexedoptionarray.py 88.92% <100.00%> (ø)
src/awkward/contents/listarray.py 90.21% <100.00%> (ø)
src/awkward/contents/unmaskedarray.py 66.23% <100.00%> (ø)
src/awkward/contents/regulararray.py 88.10% <0.00%> (+0.19%) ⬆️

@agoose77 agoose77 added the pr-next-release Required for the next release label Dec 3, 2022
Copy link
Member

@jpivarski jpivarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I intended for any Iterable of strings to count as a fields selection, including Awkward Arrays of strings. Empty Iterables are only ambiguous (field-selection or row-selection by integers?) if untyped, and Awkward Arrays are one way of providing a runtime type. NumPy arrays are another.

I'd prefer to keep that feature. I guess there weren't any tests preventing you from changing it, but I did check that it worked while developing it.

On the other example with the ?int64 vs int64 toggling Awkward and NumPy slicing, I can see why that happens and I agree that it's confusing. We might say some point have to deprecate that behavior (warning on NumPy-style slicing and then phase it out—forcing users to explicitly wrap as NumPy at some point...?), but not now—it's too close to release time and that would be a major, major change.

So let's leave the confusing but "correct" argmax behavior. I'd like to switch back to letting any Iterable of strings select fields, though, if it's not too much trouble.

@agoose77
Copy link
Collaborator Author

agoose77 commented Dec 3, 2022

On the other example with the ?int64 vs int64 toggling Awkward and NumPy slicing, I can see why that happens and I agree that it's confusing. We might say some point have to deprecate that behavior (warning on NumPy-style slicing and then phase it out—forcing users to explicitly wrap as NumPy at some point...?), but not now—it's too close to release time and that would be a major, major change.

As a precursor statement, I would not advocate changing slicing dramatically at this point. So, agreed! Also, I don't have a proposal here - this seems to me to be a fundamental constraint with our indexing; we support many indexing features, and they are not mutually exclusive, so we have to choose according to some scheme.

My long-running feeling has generally been that it's better to have useful type information (e.g. "I reduced this axis, and so I have a length-1 dimension") over pandering to the shortfalls of our indexing mechanism (e.g. "I reduced a var dimension, so it stays var"). And, to be clear, I don't think any of this is "wrong" or "right", it just has its pros and cons.

We might say some point have to deprecate that behavior (warning on NumPy-style slicing and then phase it out—forcing users to explicitly wrap as NumPy at some point...?), but not now—it's too close to release time and that would be a major, major change.

Actually, this seems like the "best" solution to me in the long-long term. Introducing a new accessor like .at would make a lot of code less readable long-term, and I don't think most people are likely to want NumPy indexing. Awkward's indexing is more powerful for the kinds of data we work with. We can discuss this after the release; I'll open an issue.

I raised this just to make sure I'm not doing anything daft ­— this is very fundamental code I'm changing (fixing), and I wanted to make sure that we're all on the same page.

@agoose77
Copy link
Collaborator Author

agoose77 commented Dec 3, 2022

I've reverted b4456fc and added new tests that enshrine this behavior :)

Copy link
Member

@jpivarski jpivarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reverting the Awkward Array of strings behavior.

I approve the intention of this PR, and I have a question about only one line of code. When you've answered it for yourself, you can merge.

I can't check the code deeply, but that one line of code was the only one that looks suspicious to me.

src/awkward/_slicing.py Outdated Show resolved Hide resolved
@agoose77 agoose77 merged commit 65ffb92 into main Dec 3, 2022
@agoose77 agoose77 deleted the agoose77/fix-regular-indexing branch December 3, 2022 22:28
@jpivarski jpivarski removed the pr-next-release Required for the next release label Feb 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Regular indexing behaves like jagged index for RegularArray (v2)
2 participants