-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-41098: [Python] Add copy keyword in Array.__array__ for numpy 2.0+ compatibility #41071
GH-41098: [Python] Add copy keyword in Array.__array__ for numpy 2.0+ compatibility #41071
Conversation
…y 2.0+ compatibility
python/pyarrow/array.pxi
Outdated
@@ -1543,7 +1543,8 @@ cdef class Array(_PandasConvertible): | |||
def _to_pandas(self, options, types_mapper=None, **kwargs): | |||
return _array_like_to_pandas(self, options, types_mapper=types_mapper) | |||
|
|||
def __array__(self, dtype=None): | |||
def __array__(self, dtype=None, copy=None): | |||
# TODO honor the copy keyword |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to open an issue for that
|
@github-actions crossbow submit test-conda-python-3.11-pandas-upstream_devel test-conda-python-3.10-pandas-nightly |
Revision: 686865e Submitted crossbow builds: ursacomputing/crossbow @ actions-3ef08a5f21
|
python/pyarrow/array.pxi
Outdated
@@ -1543,7 +1543,20 @@ cdef class Array(_PandasConvertible): | |||
def _to_pandas(self, options, types_mapper=None, **kwargs): | |||
return _array_like_to_pandas(self, options, types_mapper=types_mapper) | |||
|
|||
def __array__(self, dtype=None): | |||
def __array__(self, dtype=None, copy=None): | |||
# TODO honor the copy=True case |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we raise an exception for now? Also, can you open a GH issue for it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we raise an exception for now?
I wouldn't do that I think, because then when numpy would start passing that down in let's say numpy 2.1, np.array(obj)
would start erroring, while those never errored before (and for some cases this might incorrectly not return a copy (although still marked as read-only), many cases already do copy anyways)
Although to be honest, I don't really know what the strategy of numpy will be to enable this keyword. If they would just start to stop copying the result of __array__
, that would cause such changes in many libraries with __array__
s)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though ideally, we just directly implement copy=True
as well (it was just lest critical for numpy 2.0 as it is not yet used, and also a bit harder to implement)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and also a bit harder to implement
On second thought, it might be relatively easy to determine if to_numpy()
returned a copy or not? I was first thinking we would have to mimic the logic based on the type ("if primitive and no nulls, then it is zero copy"), but we might be able to check if the resulting ndarray has a base pointing to a pyarrow object?
Although the simple logic of numeric+no nulls might be easier in practice.
python/pyarrow/table.pxi
Outdated
@@ -525,7 +525,8 @@ cdef class ChunkedArray(_PandasConvertible): | |||
|
|||
return values | |||
|
|||
def __array__(self, dtype=None): | |||
def __array__(self, dtype=None, copy=None): | |||
# copy keyword can be ignored because to_numpy() already returns a copy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, copy=False
should then raise an error, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, that comment is from before I decided to handle the copy=False case for Array. Indeed we should just raise an error here that a no-copy is not possible.
Updated.
python/pyarrow/table.pxi
Outdated
@@ -1533,7 +1534,8 @@ cdef class _Tabular(_PandasConvertible): | |||
raise TypeError(f"Do not call {self.__class__.__name__}'s constructor directly, use " | |||
f"one of the `{self.__class__.__name__}.from_*` functions instead.") | |||
|
|||
def __array__(self, dtype=None): | |||
def __array__(self, dtype=None, copy=None): | |||
# copy keyword can be ignored as this always already returns a copy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here?
python/pyarrow/array.pxi
Outdated
values = self.to_numpy(zero_copy_only=False) | ||
if copy is True and is_primitive(self.type.id) and self.null_count == 0: | ||
# to_numpy did not yet make a copy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this make a second copy for e.g. boolean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, boolean is not considered as primitive (although you might expect that). So a boolean array takes the code path below just returning the numpy values
is no dtype was required, so no second copy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, sorry, boolean is of course "primitive", what I meant to say is that it is not "numeric", and the code above was meant to use is_numeric
(I added that to libarrow.pxd for that purpose). Good catch ..
@github-actions crossbow submit test-conda-python-3.11-pandas-upstream_devel test-conda-python-3.10-pandas-nightly |
Revision: 4b630f6 Submitted crossbow builds: ursacomputing/crossbow @ actions-0fad67714d
|
(the failure is the flaky join_asof test) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jorisvandenbossche !
… compatibility (#41071) ### Rationale for this change Adapting for changes in numpy 2.0 as decribed at https://numpy.org/devdocs/numpy_2_0_migration_guide.html#adapting-to-changes-in-the-copy-keyword and future changes to pass copy=True (numpy/numpy#26208) ### What changes are included in this PR? Add a `copy=None` to the signatures of our `__array__` methods. This does have impact on the user facing behaviour, though. Questioning that upstream at numpy/numpy#25941 (comment) ### Are these changes tested? Yes ### Are there any user-facing changes? No (compared to usage with numpy<2) * GitHub Issue: #39532 * GitHub Issue: #41098 Authored-by: Joris Van den Bossche <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 80258fe. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 6 possible false positives for unstable benchmarks that are known to sometimes produce them. |
…y 2.0+ compatibility (apache#41071) ### Rationale for this change Adapting for changes in numpy 2.0 as decribed at https://numpy.org/devdocs/numpy_2_0_migration_guide.html#adapting-to-changes-in-the-copy-keyword and future changes to pass copy=True (numpy/numpy#26208) ### What changes are included in this PR? Add a `copy=None` to the signatures of our `__array__` methods. This does have impact on the user facing behaviour, though. Questioning that upstream at numpy/numpy#25941 (comment) ### Are these changes tested? Yes ### Are there any user-facing changes? No (compared to usage with numpy<2) * GitHub Issue: apache#39532 * GitHub Issue: apache#41098 Authored-by: Joris Van den Bossche <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
…y 2.0+ compatibility (apache#41071) ### Rationale for this change Adapting for changes in numpy 2.0 as decribed at https://numpy.org/devdocs/numpy_2_0_migration_guide.html#adapting-to-changes-in-the-copy-keyword and future changes to pass copy=True (numpy/numpy#26208) ### What changes are included in this PR? Add a `copy=None` to the signatures of our `__array__` methods. This does have impact on the user facing behaviour, though. Questioning that upstream at numpy/numpy#25941 (comment) ### Are these changes tested? Yes ### Are there any user-facing changes? No (compared to usage with numpy<2) * GitHub Issue: apache#39532 * GitHub Issue: apache#41098 Authored-by: Joris Van den Bossche <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
Rationale for this change
Adapting for changes in numpy 2.0 as decribed at https://numpy.org/devdocs/numpy_2_0_migration_guide.html#adapting-to-changes-in-the-copy-keyword and future changes to pass copy=True (numpy/numpy#26208)
What changes are included in this PR?
Add a
copy=None
to the signatures of our__array__
methods.This does have impact on the user facing behaviour, though. Questioning that upstream at numpy/numpy#25941 (comment)
Are these changes tested?
Yes
Are there any user-facing changes?
No (compared to usage with numpy<2)
copy
keyword for numpy 2.0+ compatibility #41098