-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Best practices for "carrying through" type-specific operations to wrapped types #6
Comments
ContextI was working on a prototype for duck arrays supporting vector-valued data on bin edges for pydata/xarray#5775. This exercise escalated into a potential solution for this (#6) as well as the "layer removal" in #5. A solution that most likely does not scale is adding a Introducing
|
@SimonHeybrock Thanks for sharing your idea on this! In case it is helpful for anyone else, I needed to mock this up in order to work my head around it...some very rough code beneath the details tag: import numpy as np
class HighLevelWrappingArray:
def __init__(self, array, attrs=None):
"""An array with attrs dict and automatic exposure of these attrs through getattr.
A minimially viable mock of xarray DataArray for testing __array_property__.
"""
self._array = array
self._attrs = attrs if attrs is not None else {}
def __repr__(self):
return f"{self.__class__.__name__}(\n{self._array}\n{self._attrs}\n)"
def __array_property__(self, name, wrap):
if hasattr(self._array, '__array_property__'):
return self._array.__array_property__(
name, wrap=lambda x: wrap(self.__class__(x, self._attrs))
)
raise AttributeError(f"{self.__class__} has no array attribute '{name}'")
def __getattr__(self, name):
if name == "data":
return self._array
elif name in self._attrs:
return self._attrs[name]
else:
return self.__array_property__(name, wrap=lambda x: x)
class MidLevelWrappingArray:
def __init__(self, array, units=None):
"""An array with units attribute.
A minimially viable mock of pint Quantity for testing __array_property__.
"""
self._magnitude = array
self._units = units if units is not None else "dimensionless"
def __repr__(self):
return f"{self.__class__.__name__}(\n{self.magnitude}\n{self.units}\n)"
def __array_property__(self, name, wrap):
if name == "magnitude":
return wrap(self._magnitude)
if name == "units":
return self._units
if hasattr(self._magnitude, '__array_property__'):
return self._magnitude.__array_property__(
name, wrap=lambda x: wrap(self.__class__(x, self._units))
)
raise AttributeError(f"{self.__class__} has no array attribute '{name}'")
def __getattr__(self, name):
if name == "magnitude":
return self._magnitude
if name == "units":
return self._units
return self.__array_property__(name, wrap=lambda x: x)
class LowLevelWrappingArray:
def __init__(self, array, named_segments=None):
"""An array with special properties related to labeled segments.
A minimally viable mock of Dask Array for testing __array_property__.
"""
self._array = array
self._segments = named_segments if named_segments is not None else {}
def __repr__(self):
return f"{self.__class__.__name__}(\n{self._array}\n{self._segments}\n)"
def as_full_array(self):
return self._array
def get_segment(self, segment_name):
return self._array[self._segments[segment_name]]
def __array_property__(self, name, wrap):
if name == "get_segment":
return lambda x: wrap(self.get_segment(x))
if name == "as_full_array":
return lambda: wrap(self.as_full_array())
if hasattr(self._array, '__array_property__'):
return self._array.__array_property__(
name, wrap=lambda x: wrap(self.__class__(x))
)
raise AttributeError(f"{self.__class__} has no array attribute '{name}'")
def __getattr__(self, name):
return self.__array_property__(name, wrap=lambda x: x)
class NDArrayWithArrayProperty(np.lib.mixins.NDArrayOperatorsMixin):
__array_priority__ = 20
def __init__(self, array):
self._array = array
def __repr__(self):
return f"{self.__class__.__name__}(\n{self._array}\n)"
def __array__(self):
return np.asarray(self._array)
def __array_ufunc__(self, ufunc, method, *inputs, **kwargs):
if method == '__call__':
inputs = [arg._array if isinstance(arg, self.__class__) else arg
for arg in inputs]
return self.__class__(ufunc(*inputs, **kwargs), **self.attrs)
else:
return NotImplemented
def __getitem__(self, key):
return self._array[key]
def __array_property__(self, name, wrap):
if hasattr(self._array, name):
return getattr(self._array, name)
raise AttributeError(f"{self._array.__class__} has no array attribute '{name}'") There is a fair bit of complexity here, so my below points below may be the result of misinterpreting this concept. If so, please let me know where I misunderstood something! As written/proposed, I'd be concerned about its fragility (i.e., needing every array layer to implement the
That all being said, my hunch is that with type annotations, there could be enough information for standardized tooling to parse all this and set up the from wrapping_arrays import RegisterWrappingArrayType, get_wrapped_property
@RegisterWrappingArrayType
class HighLevelWrappingArray:
...
def __getattr__(self, name: str) -> Any:
# any built-in __getattr__
return get_wrapped_property(self._rewrap_contents, self._wrapped_array, name) A nice (potential) benefit of not enforcing a protocol all the way to the bottom of the type casting hierarchy in order to make this work out is that only arrays that wrap other arrays would need to implement this (e.g., xarray, pint, dask; not cupy, numpy, etc.). Instead, at the lowest level (e.g., when a wrapped type shows up that is not registered as a wrapping type), the property behavior can simply be determined by the Array API. Also, given that a "Duck Array DAG Library" was already proposed in #3, these property utilities would have a natural home there. |
I absolutely share your concern about complexity, but doesn't this decoupled approach actually reduce the overall complexity? Individual packages need to know less about each other, and the need for helper packages like More specifically, implementing
Not sure I understand this one. Several comments:
For example, for
I am not sure I understand your suggestion. It sounds like just forwarding to any attribute found on lower levels (plus the rewrap)? How does this address the issue of drowning in a sea of attributes?
Neither does my suggestion. There are no changes/patches to NumPy in my prototype. |
Thanks for the thorough reply! Unfortunately, though, I think we both have not quite been understanding each others' points. My apologies if any of that was due to a lack of clarity and/or diligence on my part; I'll try doing better below. Hopefully the back-and-forth on these points of discussion will help us work towards a better and more mutually understood solution!
My latter idea about how to implement this "carrying" through of properties should be decoupled (from the viewpoint of direct array library to array library interactions) in the same fashion as the former protocol. It is of course still coupled to a common library, but that should both be lightweight from an implementation perspective and most likely already existing in some extent due to #3. Also, I was originally referring to complexity of code that needs to exist within any given array wrapping library, or equivalently, what we would need to instruct libraries to implement to support this. Overall, yes, having a separate package encapsulate these details would be more complex, but that complexity would exist within a single shared library, which I would believe would lead to less points of failure.
Yes, either approach suggested so far (and indeed, any full solution to this GitHub issue) should reduce the need for such helper packages, as well as support things that take way too much manual unwrapping and rewrapping at the moment.
Unfortunately, there is no such thing as the top level in this context; any array type that wraps other arrays could end up as outermost layer if the higher-level types are not in use. So, for example, we need to fully support situations of pint Quantities wrapping, say, Sparse arrays just the same as xarray DataArrays wrapping pint Quantities wrapping anything else. So, if attribute access is the way forward for this, then we'd need to work through
Yes indeed! In a partial API case, that kind of decorator was exactly what I was suggesting. Though, depending on how much information could be parsed from the property signature / type annotation, there may need to be several options of such decorators, since some properties will need to be rewrapped, some exposed as-is, and others handled as callables (with or without rewrapping the output, rather than the callable itself).
As far as I understood the
I think this was the key misunderstanding I had about your initial idea and likely the fulcrum of our diverging perspectives on this topic. I also apologize for my exaggerated and poor phrasing of "essentially their entire API"...in retrospect I should have said something more like "a significant portion of their API", as many portions of the API are of course either not relevant to be user-exposed in this way (e.g., format conversions, alternate constructors) or consist of operations handled through other means (e.g., math operations). I would still disagree though with severely limiting the properties exposed through this mechanism. When using multiply nested arrays, my use cases may demand any number of different aspects of the APIs of each array layer. So, if only a small portion of the API were to be carried through to higher level types, then I am no better off than I am at present (needing to manually unwrap and rewrap). To borrow your example, say So, given that the breath of user expectations likely demands that substantial portions of every layer's API is exposed, but (as you said) also having the need to avoid a mess of properties on higher levels, I think our discussion back-and-forth has revealed the need for some kind of namespace to be a core consideration. How to name these "wrapped array property namespaces" and how they relate to/differ from something like xarray's accessors (or even if a separate, rather than attached method/property chaining, user interface needs to be considered) would definitely need to be further thought out.
This would indeed forward to any registered attribute found on lower levels plus conditionally (based on what kind of attribute) rewrapping. Given that such collections of registered attributes would likely still need to be large (see prior point), this indeed does not address the issue of drowning in a sea of attributes, so we'd need something like namespaces to avoid that.
I think I'm missing something here...as alluded to previously, how are potentially low-level attributes (like |
Maybe this is the key misunderstanding in our discussion: I had considered these properties as something that most array implementations provide anyway? If we assume that array implementers stick close to the Python Array API there is no need to handle any of those properties with the proposed mechanism. |
Yes, that is another key misunderstanding, thank you for clarifying! Coming from a pint-informed mindset where Array API-type things are implemented through this kind of attribute deferral, it had not occurred to me that Array API details should be handled separately within in array wrapping types rather than deferring. However, declaring that "wrapping array implementors should ensure they implement the Array API apart from this attribute deferral protocol" seems like a reasonable and clean-cut stance to take! |
Indeed, I am still struggling with that one. One option I am considering is providing an accessor, but not with Xarray's mechanism (which does not work with wrapping), but via da.pint.to('mm') # `pint` property found via `__array_property__` The helper object returned by the
Your suggestion sounds similar to the |
I slightly extended my prototype to support dask: array = dask.array.arange(15).reshape(3, 5)
vectors = sx.VectorArray(array, ['vx', 'vy', 'vz'])
edges = sx.BinEdgeArray(vectors)
data = Quantity(edges, 'meter/second')
masked = sx.MultiMaskArray(data,
masks={'mask1': np.array([False, False, True, False])})
da = xr.DataArray(dims=('x', ), data=masked, coords={'x': np.arange(4)})
result = da.xcompute() # named xcompute here to avoid calling xarray.compute The thing to note here is that |
@SimonHeybrock Thanks for the follow-ups! In the coming days, I'll be writing up a thorough post for the Scientific Python discourse on the topics in this repo, and I plan on using what you have here for |
Thanks. I'll keep looking into this as well. I think am finding more problems, such as NEP-35 (which says it fixes shortcomings of NEP-18) still having shortcomings (or maybe just its implementation?): The |
In many use cases, we can have "multiply nested" duck arrays rather than just two interacting duck array types. This creates a couple issues:
Note: this issue is mostly relevant to array types higher on DAG, as they are the ones handling the respective interactions.
Current State
Many of the libraries higher on the DAG (e.g., xarray and pint) have some form of attribute fallback to their wrapped data. However, more complex methods/operations often require direct handling (such as rechunking an underlying dask array contained in an xarray data structure) that each library has to implement itself. This approach quickly becomes impractical as many array types become involved (e.g., how can xarray do all of pint, dask, and CuPy operations on a
xarray(pint(dask(cupy)))
nested array?), especially if there is not a standard way of carrying through these operations "down the stack."Specific Goals
xarray(pint(dask(numpy)))
nested array, or sparse-specific things on apint(dask(sparse))
nested arraySuggested Paths Forward
Discussion
The text was updated successfully, but these errors were encountered: