Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: to_list must follow __getitem__ implementations, even in Records #1652

Merged
merged 7 commits into from
Sep 1, 2022

Conversation

jpivarski
Copy link
Member

The to_list implementation has to behave like a recursive iteration through an ak.Array/ak.Record, replacing all lists with Python list, all records with Python dict, and all missing values with None. Iteration means calling __getitem__ for each index, so if __getitem__ is overloaded (though ak.behavior), we should see the consequence of that in the to_list output.

The v1 to_list was implemented this way, but with the un-overloaded __getitem__, that results in a lot of indirection and type-checking that can be avoided, because we know the type of our array. A vectorized implementation, optimized for un-overloaded __getitem__, was first implemented in ee0c2cb, which acted in vectorized strides over each node in the layout tree. Then #1418 fixed the case of small slices of a large array (don't vectorize over the whole thing, just the part you see).

The explicit check for cls.__getitem__ is ak._v2.highlevel.Array.__getitem__ is correct: that part of the code wants to know if we are in the optimized, un-overloaded case, or something else. If something else, we do what would be the straightforward implementation of to_list: calling __getitem__ on every item. This allows the overloaded __getitem__ to do whatever it wants, returning any type that it wants, and to_list behaves as it should.

What was missing in #1600 was (a) an equivalent implementation for Record, and (b) continuing to recurse to_list over the output of __getitem__. This PR fixes those two things, with special care in ak._v2.record.Record to prevent infinite recursion.

@agoose77
Copy link
Collaborator

I might be misreading the UI, but in case my understanding is right, we don't want to close #1600 with this PR unless we create a new issue to track the performance implication - I think vector's use case is legitimate enough that we don't want to hurt to_list speed silently.

@codecov
Copy link

codecov bot commented Aug 30, 2022

Codecov Report

Merging #1652 (079951d) into main (9e17f29) will increase coverage by 0.52%.
The diff coverage is 64.75%.

Additional details and impacted files
Impacted Files Coverage Δ
src/awkward/_v2/_connect/cuda/__init__.py 0.00% <0.00%> (ø)
src/awkward/_v2/_connect/numexpr.py 88.40% <0.00%> (ø)
src/awkward/_v2/_connect/pyarrow.py 88.46% <0.00%> (ø)
...awkward/_v2/_connect/rdataframe/from_rdataframe.py 0.00% <0.00%> (ø)
...c/awkward/_v2/_connect/rdataframe/to_rdataframe.py 0.00% <0.00%> (ø)
src/awkward/_v2/_lookup.py 98.68% <ø> (+1.17%) ⬆️
src/awkward/_v2/numba.py 93.47% <0.00%> (ø)
src/awkward/_v2/operations/ak_from_avro_file.py 66.66% <0.00%> (ø)
src/awkward/_v2/operations/ak_local_index.py 100.00% <ø> (ø)
src/awkward/_v2/operations/ak_max.py 87.09% <ø> (ø)
... and 65 more

@jpivarski
Copy link
Member Author

jpivarski commented Aug 30, 2022

I haven't had a chance to address all of your comments yet—sorry!

But, coming to that, I'm thinking of issue #1600 as an issue of consistency in output, the fact that to_list returned the wrong results, not a performance issue.

So now let's consider the performance implications. The simplest implementation to get the right results is to call __getitem__ on every element, recursively, and wrapping ListType in Python list, etc. The vectorized implementation _assumes_ it knows what getitem` will return and uses that knowledge to be 10× faster. If it can't use the fast path, it defers to the simple implementation, so there's no loss, just no gain.

You're asking if we can do any better, by recognizing that Vector's __getitem__ is

def __getitem__(self, where):
    return super().__getitem__(where)

and using the fast-path on it, too. We would definitely be getting too intrusive if we tried to examine the function to determine that it is inert. Presumably, there are other ways of writing this function that are equally inert—looking at version-dependent bytecode or using the inspect module to re-compile it into an AST is way too much of a rabbit hole.

Your specific suggestion was to allow ak.Array/ak.Record subclasses to override their to_list as well as their __getitem__. Overriding __getitem__ by itself might be a performance bomb, but a knowledgeable downstream developer would override both and get the performance back.

Let's see what the damage is: failing to take the fast-path should put the performance back at the level where it was in v1.

import awkward as ak
file = "https://pivarski-princeton.s3.amazonaws.com/chep-2021-jagged-jagged-jagged/zlib9-jagged3.parquet"
array_v1 = ak.from_parquet(file)
array_v2 = ak._v2.from_parquet(file)

%%timeit
ak.to_list(array_v1[:10000])
# 22 seconds

%%timeit
ak._v2.to_list(array_v2[:10000])
# 0.7 seconds

Overloading __getitem__ on a v2 array,

class Overload(ak._v2.Array):
    def __getitem__(self, where):
        return super().__getitem__(where)

ak._v2.behavior["overload"] = Overload

overloaded = ak._v2.with_parameter(array_v2.layout, "__array__", "overload")

Crud, that's only the outermost layer...

array_v2.layout.content.content.content.parameters["__array__"] = "overload"
array_v2.layout.content.content.parameters["__array__"] = "overload"
array_v2.layout.content.parameters["__array__"] = "overload"
array_v2.layout.parameters["__array__"] = "overload"
array_v2 = ak._v2.Array(array_v2)

(Okay! Okay! I'm using it as mutable.)

%%timeit
ak._v2.to_list(array_v2[:10000])
# 677 seconds

So, that is a performance bomb: v2 went from being 30× faster than v1 to being 30× slower. (Interesting symmetry, but only approximate.) I looked into it with cProfiler, and most of the slowness is in setting up asynchronous-ready error-handling (so that we can support GPUs and other asynchronous computation: #1321). In other words, the problem is that highlevel __getitem__ itself is more expensive than in v1, and whenever anybody calls it millions of times, they'll notice the lag. Overload.__getitem__ explicitly calls it, and since we aren't allowed to guess at what this function does, we have to call it millions of times.

Anyway, we do need to have some sort of backdoor for downstream developers to avoid this catastrophe. Letting them override to_list opens a can of worms because we would need to make an API for it, which would require hand-offs between us and the downstream developers at each level of list-node. Since this is going to require insider knowledge anyway, could we just give it an "ignore me" flag?

class Overload(ak._v2.Array):
    def __getitem__(self, where):
        return super().__getitem__(where)

    __getitem__.ignore_in_to_list = True

ak._v2.behavior["overload"] = Overload

The previous commit (f8e0dae) implements that flag, and it puts the overloaded array's to_list back at 0.7 seconds. Of course, this is also an API to maintain forever. Maybe we should get systematic about how such things are to be expressed? (They'll have to be explained in documentation, too.) The overloading capabilities themselves are systematically expressed in the ak.behavior dict, but this seems to me like a different thing, a performance hint instead of a behavior-changing thing.

Let me know in a review if you think this is the right way to go, and take over this PR/add an alternative if you think that's a better idea. This is one that you've thought about a lot.


return self._array[self._at : self._at + 1]._to_list(behavior, None)[0]
def _to_list(self, behavior, json_conversions):
Copy link
Collaborator

@agoose77 agoose77 Aug 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the challenge here is working out what to_list is supposed to allow. I had originally been thinking that the slow case allows users to return non-awkward types, which would present a problem when using from_iter to re-load the same data (e.g. if it returns a random Python object). However, whilst it looks like we currently do permit that, I don't know whether we actively intend to support it. It might just be a side effect of allowing to_list to return something that is different to the default to_list, e.g. a filtered view of the data.

You seem to acknowledge this question in

this seems to me like a different thing, a performance hint instead of a behavior-changing thing.

So I'd be interested to hear your take.

I don't know if I've already said this, but I'm wondering whether we need to do any of this - why not leave overriding __getitem__ unsupported (given that strings are already handled internally)? We could look to adding support down the road if someone needs it. Whereas, if we keep this code in-place, then we are accepting a burden of supporting it for the indefinite future.

So, take my opinion as "can we just remove this code?", but I'm on-board with adding a performance flag rather than full to_list overload if removal is off the table :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, take my opinion as "can we just remove this code?"

Well, the first thing is that we can't prevent people from overriding __getitem__, and ignoring the fact that they might have overridden it would make the output of to_list wrong in a way that exposes its implementation. I can't see how we can not do this.

If someone naively overrides __getitem__ and to_list returns the right output but is slower, that's not an error. It might be undesirable ("might" depends on whether the user even cares about to_list, as that's a way of exiting Awkward's system and they might not plan to do that), but now we have a sneaky way to patch it, if the downstream developer can tell us that __getitem__'s behavior is not changed.

I think the challenge here is working out what to_list is supposed to allow.

The to_list function converts Awkward data into Python data, as though iterating recursively, turning every Awkward list into a Python list and every Awkward record into a Python dict (and missing values into None). This is a naive implementation of to_list:

>>> def to_list(array):
...     if isinstance(array, ak._v2.Array):
...         return [to_list(array[index]) for index in range(len(array))]
...     elif isinstance(array, ak._v2.Record):
...         return {field: to_list(array[field]) for field in array.fields}
...     else:
...         return array

Leaf-node values, the NumpyArrays, return what NumPy returns for its tolist. For example, if the NumPy types are dates, NumPy returns datetime.datetime objects:

>>> array = ak._v2.Array(np.array([0, 1, 2], "datetime64[Y]"))
>>> array
<Array [1970, 1971, 1972] type='3 * datetime64[Y]'>
>>> array.tolist()
[datetime.date(1970, 1, 1), datetime.date(1971, 1, 1), datetime.date(1972, 1, 1)]

which are, in my opinion, random Python objects. The IP address case we talked about is another one in which someone would want to override __getitem__:

>>> import ipaddress
>>> class IPv4Array(ak._v2.Array):
...     def __getitem__(self, where):
...         out = super().__getitem__(where)
...         return ipaddress.ip_address(np.asarray(out).tobytes())
... 
>>> ak._v2.behavior["IPv4"] = IPv4Array
>>> array = ak._v2.with_parameter(
...     ak._v2.without_parameters(
...         ak._v2.Array([b"\x7f\x00\x00\x01", b"\xff\xff\xff\xff"])
...     ),
...     "__array__",
...     "IPv4",
... )
>>> array
<IPv4Array [[127, 0, 0, 1], [255, ..., 255]] type='2 * [var * uint8, paramete...'>
>>> array[0]
IPv4Address('127.0.0.1')
>>> array[1]
IPv4Address('255.255.255.255')
>>> array.tolist()
[IPv4Address('127.0.0.1'), IPv4Address('255.255.255.255')]

Here, instead of going to Python date objects, or Python strings as in the case of __array__ = "string", they're going into IPv4Address objects. That's what to_list is supposed to do: give you Python objects.

would present a problem when using from_iter to re-load the same data

This is not round-trippable with from_iter, but it never was: to_list is a lossy operation that loses the distinction between fixed-length lists and variable-length lists, as well as the types of any list contents without values. This is also true of NumPy's tolist and array constructor. These functions provide ways of getting data between vectorized form and Python form, but these forms are not isomorphic to each other.

@jpivarski
Copy link
Member Author

Since this was mentioned in @Saransh-cpp's talk as (I think) the only unresolved Awkward issue in updating Vector to Awkward v2, I think it would be best to merge it. The two possible actions (that I can think of) are to merge it as-is, allowing overrides of __getitem__ to affect to_list behavior, with a decorator performance hint when the to_list behavior is not supposed to change, or to strip it down such that overrides of __getitem__ don't affect to_list behavior at all.

@agoose77
Copy link
Collaborator

agoose77 commented Sep 1, 2022

@jpivarski for completeness:

Well, the first thing is that we can't prevent people from overriding getitem, and ignoring the fact that they might have overridden it would make the output of to_list wrong in a way that exposes its implementation. I can't see how we can not do this.

I would argue that overwriting __getitem__ is wrong, so users shouldn't do that! I can't immediately think of a use case given that any __getitem__ change wouldn't modify how Awkward interacts with the array under the hood.

With your ipaddress example (nice example, btw), I'd advocate for adding to_ip_address to the behavior, which explicitly produces an IPv4Address object.

That's where I leave it, though - I'm happy with this alternative solution if you want to proceed :)

@jpivarski jpivarski merged commit 15318e1 into main Sep 1, 2022
@jpivarski jpivarski deleted the jpivarski/make-to_list-safe-for-overriding-getitem branch September 1, 2022 15:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Consistency between __getitem__ and to_list when a Record is the top-level
2 participants