-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: to_list must follow __getitem__ implementations, even in Records #1652
fix: to_list must follow __getitem__ implementations, even in Records #1652
Conversation
I might be misreading the UI, but in case my understanding is right, we don't want to close #1600 with this PR unless we create a new issue to track the performance implication - I think |
Codecov Report
Additional details and impacted files
|
I haven't had a chance to address all of your comments yet—sorry! But, coming to that, I'm thinking of issue #1600 as an issue of consistency in output, the fact that So now let's consider the performance implications. The simplest implementation to get the right results is to call You're asking if we can do any better, by recognizing that Vector's def __getitem__(self, where):
return super().__getitem__(where) and using the fast-path on it, too. We would definitely be getting too intrusive if we tried to examine the function to determine that it is inert. Presumably, there are other ways of writing this function that are equally inert—looking at version-dependent bytecode or using the Your specific suggestion was to allow ak.Array/ak.Record subclasses to override their Let's see what the damage is: failing to take the fast-path should put the performance back at the level where it was in v1. import awkward as ak
file = "https://pivarski-princeton.s3.amazonaws.com/chep-2021-jagged-jagged-jagged/zlib9-jagged3.parquet"
array_v1 = ak.from_parquet(file)
array_v2 = ak._v2.from_parquet(file)
%%timeit
ak.to_list(array_v1[:10000])
# 22 seconds
%%timeit
ak._v2.to_list(array_v2[:10000])
# 0.7 seconds Overloading class Overload(ak._v2.Array):
def __getitem__(self, where):
return super().__getitem__(where)
ak._v2.behavior["overload"] = Overload
overloaded = ak._v2.with_parameter(array_v2.layout, "__array__", "overload") Crud, that's only the outermost layer... array_v2.layout.content.content.content.parameters["__array__"] = "overload"
array_v2.layout.content.content.parameters["__array__"] = "overload"
array_v2.layout.content.parameters["__array__"] = "overload"
array_v2.layout.parameters["__array__"] = "overload"
array_v2 = ak._v2.Array(array_v2) (Okay! Okay! I'm using it as mutable.) %%timeit
ak._v2.to_list(array_v2[:10000])
# 677 seconds So, that is a performance bomb: v2 went from being 30× faster than v1 to being 30× slower. (Interesting symmetry, but only approximate.) I looked into it with cProfiler, and most of the slowness is in setting up asynchronous-ready error-handling (so that we can support GPUs and other asynchronous computation: #1321). In other words, the problem is that highlevel Anyway, we do need to have some sort of backdoor for downstream developers to avoid this catastrophe. Letting them override class Overload(ak._v2.Array):
def __getitem__(self, where):
return super().__getitem__(where)
__getitem__.ignore_in_to_list = True
ak._v2.behavior["overload"] = Overload The previous commit (f8e0dae) implements that flag, and it puts the overloaded array's Let me know in a review if you think this is the right way to go, and take over this PR/add an alternative if you think that's a better idea. This is one that you've thought about a lot. |
|
||
return self._array[self._at : self._at + 1]._to_list(behavior, None)[0] | ||
def _to_list(self, behavior, json_conversions): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the challenge here is working out what to_list
is supposed to allow. I had originally been thinking that the slow case allows users to return non-awkward types, which would present a problem when using from_iter
to re-load the same data (e.g. if it returns a random Python object). However, whilst it looks like we currently do permit that, I don't know whether we actively intend to support it. It might just be a side effect of allowing to_list
to return something that is different to the default to_list
, e.g. a filtered view of the data.
You seem to acknowledge this question in
this seems to me like a different thing, a performance hint instead of a behavior-changing thing.
So I'd be interested to hear your take.
I don't know if I've already said this, but I'm wondering whether we need to do any of this - why not leave overriding __getitem__
unsupported (given that strings are already handled internally)? We could look to adding support down the road if someone needs it. Whereas, if we keep this code in-place, then we are accepting a burden of supporting it for the indefinite future.
So, take my opinion as "can we just remove this code?", but I'm on-board with adding a performance flag rather than full to_list
overload if removal is off the table :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, take my opinion as "can we just remove this code?"
Well, the first thing is that we can't prevent people from overriding __getitem__
, and ignoring the fact that they might have overridden it would make the output of to_list
wrong in a way that exposes its implementation. I can't see how we can not do this.
If someone naively overrides __getitem__
and to_list
returns the right output but is slower, that's not an error. It might be undesirable ("might" depends on whether the user even cares about to_list
, as that's a way of exiting Awkward's system and they might not plan to do that), but now we have a sneaky way to patch it, if the downstream developer can tell us that __getitem__
's behavior is not changed.
I think the challenge here is working out what
to_list
is supposed to allow.
The to_list
function converts Awkward data into Python data, as though iterating recursively, turning every Awkward list into a Python list
and every Awkward record into a Python dict
(and missing values into None
). This is a naive implementation of to_list
:
>>> def to_list(array):
... if isinstance(array, ak._v2.Array):
... return [to_list(array[index]) for index in range(len(array))]
... elif isinstance(array, ak._v2.Record):
... return {field: to_list(array[field]) for field in array.fields}
... else:
... return array
Leaf-node values, the NumpyArrays, return what NumPy returns for its tolist
. For example, if the NumPy types are dates, NumPy returns datetime.datetime
objects:
>>> array = ak._v2.Array(np.array([0, 1, 2], "datetime64[Y]"))
>>> array
<Array [1970, 1971, 1972] type='3 * datetime64[Y]'>
>>> array.tolist()
[datetime.date(1970, 1, 1), datetime.date(1971, 1, 1), datetime.date(1972, 1, 1)]
which are, in my opinion, random Python objects. The IP address case we talked about is another one in which someone would want to override __getitem__
:
>>> import ipaddress
>>> class IPv4Array(ak._v2.Array):
... def __getitem__(self, where):
... out = super().__getitem__(where)
... return ipaddress.ip_address(np.asarray(out).tobytes())
...
>>> ak._v2.behavior["IPv4"] = IPv4Array
>>> array = ak._v2.with_parameter(
... ak._v2.without_parameters(
... ak._v2.Array([b"\x7f\x00\x00\x01", b"\xff\xff\xff\xff"])
... ),
... "__array__",
... "IPv4",
... )
>>> array
<IPv4Array [[127, 0, 0, 1], [255, ..., 255]] type='2 * [var * uint8, paramete...'>
>>> array[0]
IPv4Address('127.0.0.1')
>>> array[1]
IPv4Address('255.255.255.255')
>>> array.tolist()
[IPv4Address('127.0.0.1'), IPv4Address('255.255.255.255')]
Here, instead of going to Python date objects, or Python strings as in the case of __array__ = "string"
, they're going into IPv4Address
objects. That's what to_list
is supposed to do: give you Python objects.
would present a problem when using
from_iter
to re-load the same data
This is not round-trippable with from_iter
, but it never was: to_list
is a lossy operation that loses the distinction between fixed-length lists and variable-length lists, as well as the types of any list contents without values. This is also true of NumPy's tolist
and array
constructor. These functions provide ways of getting data between vectorized form and Python form, but these forms are not isomorphic to each other.
Since this was mentioned in @Saransh-cpp's talk as (I think) the only unresolved Awkward issue in updating Vector to Awkward v2, I think it would be best to merge it. The two possible actions (that I can think of) are to merge it as-is, allowing overrides of |
@jpivarski for completeness:
I would argue that overwriting With your ipaddress example (nice example, btw), I'd advocate for adding That's where I leave it, though - I'm happy with this alternative solution if you want to proceed :) |
The
to_list
implementation has to behave like a recursive iteration through an ak.Array/ak.Record, replacing all lists with Pythonlist
, all records with Pythondict
, and all missing values withNone
. Iteration means calling__getitem__
for each index, so if__getitem__
is overloaded (thoughak.behavior
), we should see the consequence of that in theto_list
output.The v1
to_list
was implemented this way, but with the un-overloaded__getitem__
, that results in a lot of indirection and type-checking that can be avoided, because we know the type of our array. A vectorized implementation, optimized for un-overloaded__getitem__
, was first implemented in ee0c2cb, which acted in vectorized strides over each node in the layout tree. Then #1418 fixed the case of small slices of a large array (don't vectorize over the whole thing, just the part you see).The explicit check for
cls.__getitem__ is ak._v2.highlevel.Array.__getitem__
is correct: that part of the code wants to know if we are in the optimized, un-overloaded case, or something else. If something else, we do what would be the straightforward implementation ofto_list
: calling__getitem__
on every item. This allows the overloaded__getitem__
to do whatever it wants, returning any type that it wants, andto_list
behaves as it should.What was missing in #1600 was (a) an equivalent implementation for
Record
, and (b) continuing to recurseto_list
over the output of__getitem__
. This PR fixes those two things, with special care inak._v2.record.Record
to prevent infinite recursion.