Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add RegularArray._reduce_next implementation #1811

Merged
merged 15 commits into from
Oct 20, 2022

Conversation

agoose77
Copy link
Collaborator

@agoose77 agoose77 commented Oct 19, 2022

Our current reduction through a RegularArray is done by casting to and from a ListOffsetArray. As described in #1790, this has two consequences:

  • We lose regular dimensions quite easily
  • We lose information in the case of zero-length dimensions

The first point can be fixed (this code excerpt from RegularArray + the changes to ListOffsetArray in this PR):

Fix for RegularArray._reduce_next
        branch, depth = self.branch_depth

        if depth == negaxis:
            if keepdims:
                return ak.contents.RegularArray(
                    out.content.toRegularArray(),
                    1,
                    self.length,
                    None,
                    None,
                    self._nplike,
                )
            else:
                return out.toRegularArray()

        if keepdims and depth == negaxis + 1:
            outcontent = out.content
            assert isinstance(
                outcontent, (ak.contents.ListOffsetArray, ak.contents.RegularArray)
            )
            # Determined by self._content._reduce_next(..., len(self), ...)
            outcontent = ak.contents.RegularArray(
                outcontent.content,
                size=1,
                zeros_length=len(self),
            )
            out = ak.contents.ListOffsetArray(
                out.offsets,
                outcontent,
                out.identifier,
                out.parameters,
                out.nplike
            )
        elif depth > negaxis + 1:
            outcontent = out.content
            assert isinstance(
                outcontent, (ak.contents.ListOffsetArray, ak.contents.RegularArray)
            )
            # Determined by self._content._reduce_next(..., len(self), ...)
            outcontent = ak.contents.RegularArray(
                outcontent.content,
                size=self._size,
                zeros_length=len(self),
            )
            out = ak.contents.ListOffsetArray(
                out.offsets,
                outcontent,
                out.identifier,
                out.parameters,
                out.nplike
            )

This second point means that just calling toRegularArray() on the ListOffsetArray._reduce_next reduction result is not sufficient; in the process of reducing the regular child of a ragged array with empty lists, we lose information about these empty lists, which then need care in order to be reconstructed. In addition to this bug, the whole process of going to-from ListOffsetArray64 here lossy, and involves multiple kernels, which reduces performance.

The actual kernels required to implement RegularArray._reduce_next seem fairly trivial. So, I wrote this PR to implement them (only for reduction).

This kind of code is really hard to reason about, though, so any extra pairs of eyes on the assumptions that I've made here would be very helpful. I might have got this horribly wrong; it's easy to get the wrong mental model, I've found.

Specifically, this PR:

  • Changes ListOffsetArray to return a RegularArray instead of ListOffsetArray when keepdims=True. This seems wrong at first, but the return result of _reduce_next is actually the parent layout. It's the parent's responsibility to coerce this to the correct type. RegularArray is lower overhead than a ragged type.
  • Fixes cases where regular types were lost in reduction.
  • Assumes that parent ordering does not matter now that we have removed the findgaps kernel (we still assume that parents need to be locally contiguous, i.e. 1 1 3 3 2 2 vs 1 3 1 3 2 2). I think this is reasonable, as there are good reasons for requiring local contiguity, but fewer for global.
  • Fix the tests that assume we lose regularity.
  • Make keepdims always insert a length=1 axis to ensure broadcastability
  • Closes Unable to reduce regular dimension in mixed layout #1790

This PR does not:

  • Do the same for RegularArray._sort_next etc. These should be done at some point (or at least, the ragged/regular type preservation improved), but there's future work on unifying sort and argsort that will make this slightly easier.

📚 The documentation for this PR will be available at https://awkward-array.readthedocs.io/en/agoose77-fix-proper-regulararray/ once Read the Docs has finished building 🔨

@agoose77 agoose77 marked this pull request as ready for review October 19, 2022 15:38
@agoose77
Copy link
Collaborator Author

@jpivarski I haven't added any kernel tests yet to the kernel test data. Do you have any suggestions about how best to do this; do I just need to churn through some examples and add them to the data-file?

@agoose77 agoose77 added the pr-needs-backport This PR needs a counterpart to backport to older versions label Oct 19, 2022
@codecov
Copy link

codecov bot commented Oct 19, 2022

Codecov Report

Merging #1811 (7b0133e) into main (569f183) will increase coverage by 0.00%.
The diff coverage is 86.11%.

Additional details and impacted files
Impacted Files Coverage Δ
src/awkward/_slicing.py 86.04% <ø> (+0.33%) ⬆️
src/awkward/contents/unmaskedarray.py 71.67% <0.00%> (+0.91%) ⬆️
src/awkward/contents/indexedoptionarray.py 88.90% <72.72%> (-0.18%) ⬇️
src/awkward/contents/bytemaskedarray.py 88.04% <77.77%> (-0.29%) ⬇️
src/awkward/contents/regulararray.py 89.92% <90.47%> (+0.70%) ⬆️
src/awkward/contents/indexedarray.py 79.46% <100.00%> (-0.50%) ⬇️
src/awkward/contents/listoffsetarray.py 79.47% <100.00%> (-0.06%) ⬇️
src/awkward/nplikes.py 66.60% <0.00%> (-0.20%) ⬇️
src/awkward/_typetracer.py 74.71% <0.00%> (-0.19%) ⬇️
... and 6 more

@agoose77
Copy link
Collaborator Author

I'd missed a couple of places where our option-types eagerly coerce RegularArrays into ListOffsetArrays. This fixes the few tests that had var introduced through reduction

@agoose77 agoose77 force-pushed the agoose77/fix-proper-regulararray branch from c62b82b to f671788 Compare October 20, 2022 12:45
@agoose77 agoose77 closed this Oct 20, 2022
@agoose77 agoose77 force-pushed the agoose77/fix-proper-regulararray branch from f671788 to 2511d4b Compare October 20, 2022 12:48
@agoose77 agoose77 reopened this Oct 20, 2022
Copy link
Member

@jpivarski jpivarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(GitHub ate my first review.)

This looks very well thought-through, and the large, explanatory comments are very helpful. It's not a situation in which I think the comments will get out of date because these things are not changed often or rapidly. I don't see anything missing, like a kernel implementation without a specification.

I haven't added any kernel tests yet to the kernel test data. Do you have any suggestions about how best to do this; do I just need to churn through some examples and add them to the data-file?

The test data was mostly made automatically. The hard part was choosing test inputs; the outputs were determined by running the kernels. It's easier to hand-craft some test inputs soon after having written the kernel, so if you have an idea in mind about test inputs that would not trivially skip the code, a good mix of valid and invalid inputs, then you can add those, run the function, and just insert the observed outputs as expected outputs. (I.e. we're not pretending to predict the function's behavior, we're just pinning it in place so we'll notice if it changes or if the CUDA version doesn't agree.)

src/awkward/contents/indexedarray.py Outdated Show resolved Hide resolved
Comment on lines 1385 to 1388
# If the result of `_reduce_next` is a list, and we're not applying at this
# depth, then it will have offsets given by the boundaries in parents.
# This means that we need to look at the _contents_ to which the `outindex`
# belongs to add the option type
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is for the code that was removed.

Suggested change
# If the result of `_reduce_next` is a list, and we're not applying at this
# depth, then it will have offsets given by the boundaries in parents.
# This means that we need to look at the _contents_ to which the `outindex`
# belongs to add the option type

src/awkward/contents/listoffsetarray.py Show resolved Hide resolved
src/awkward/contents/regulararray.py Outdated Show resolved Hide resolved
@agoose77
Copy link
Collaborator Author

agoose77 commented Oct 20, 2022

@jpivarski I've made the changes you requested, and importantly removed most of IndexedArray._reduce_next. Could you check that I've not made an oversight here? My reasoning is that IndexedArray is effectively a deferred carry, which is realised during reduction. The nextparents, starts, etc. all apply to the result of applying self._content.carry(self._index), so it should just work™.

@jpivarski
Copy link
Member

That's right, it is just a carry!

Very likely, this happened because the v1 C++ IndexedArray and IndexedOptionArray were a single class, and more work is needed for the IndexedOptionArray (the else clause of if (index[i] >= 0) in awkward_IndexedArray_reduce_next_64). I don't think I ever noticed that the IndexedArray case is simpler. Now that they're two Python classes, the ways in which IndexedArray can collapse down are more apparent.

Do you want to add kernel test samples? (Not all of the kernels have them.) Otherwise, the PR is done and can be merged.

@agoose77
Copy link
Collaborator Author

Very likely, this happened because the v1 C++ IndexedArray and IndexedOptionArray were a single class, and more work is needed for the IndexedOptionArray (the else clause of if (index[i] >= 0) in awkward_IndexedArray_reduce_next_64).

Yes, this was my assessment too. And, it would have been risky to try and simplify this at v1→v2 time; much safer to do this now that things are stable and working.

@agoose77
Copy link
Collaborator Author

Do you want to add kernel test samples? (Not all of the kernels have them.) Otherwise, the PR is done and can be merged.

My brain is starting to hurt from spending so much time on reducers - is this something you'd have the cycles for? If not, then let's merge this and I'll make a mental note to get to this down the line.

@jpivarski
Copy link
Member

I was saying that it's optional. So I'll merge it now (after tests).

Same for #1813.

@jpivarski jpivarski enabled auto-merge (squash) October 20, 2022 19:43
@jpivarski jpivarski merged commit fe2ff3f into main Oct 20, 2022
@jpivarski jpivarski deleted the agoose77/fix-proper-regulararray branch October 20, 2022 19:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-needs-backport This PR needs a counterpart to backport to older versions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unable to reduce regular dimension in mixed layout
2 participants