Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-41098: [Python] Add copy keyword in Array.__array__ for numpy 2.0+ compatibility #41071

Merged
merged 8 commits into from
Apr 15, 2024

Conversation

jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Apr 8, 2024

Rationale for this change

Adapting for changes in numpy 2.0 as decribed at https://numpy.org/devdocs/numpy_2_0_migration_guide.html#adapting-to-changes-in-the-copy-keyword and future changes to pass copy=True (numpy/numpy#26208)

What changes are included in this PR?

Add a copy=None to the signatures of our __array__ methods.

This does have impact on the user facing behaviour, though. Questioning that upstream at numpy/numpy#25941 (comment)

Are these changes tested?

Yes

Are there any user-facing changes?

No (compared to usage with numpy<2)

@@ -1543,7 +1543,8 @@ cdef class Array(_PandasConvertible):
def _to_pandas(self, options, types_mapper=None, **kwargs):
return _array_like_to_pandas(self, options, types_mapper=types_mapper)

def __array__(self, dtype=None):
def __array__(self, dtype=None, copy=None):
# TODO honor the copy keyword
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to open an issue for that

Copy link

github-actions bot commented Apr 8, 2024

⚠️ GitHub issue #39532 has been automatically assigned in GitHub to PR creator.

@github-actions github-actions bot added Component: Python awaiting committer review Awaiting committer review awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Apr 8, 2024
@jorisvandenbossche jorisvandenbossche changed the title GH-39532: [Python] Add copy keyword in Array.__array__ for numpy 2.0+ compatibility GH-41098: [Python] Add copy keyword in Array.__array__ for numpy 2.0+ compatibility Apr 9, 2024
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Apr 9, 2024
@jorisvandenbossche
Copy link
Member Author

@github-actions crossbow submit test-conda-python-3.11-pandas-upstream_devel test-conda-python-3.10-pandas-nightly

Copy link

github-actions bot commented Apr 9, 2024

Revision: 686865e

Submitted crossbow builds: ursacomputing/crossbow @ actions-3ef08a5f21

Task Status
test-conda-python-3.10-pandas-nightly GitHub Actions
test-conda-python-3.11-pandas-upstream_devel GitHub Actions

@jorisvandenbossche jorisvandenbossche marked this pull request as ready for review April 9, 2024 12:42
@@ -1543,7 +1543,20 @@ cdef class Array(_PandasConvertible):
def _to_pandas(self, options, types_mapper=None, **kwargs):
return _array_like_to_pandas(self, options, types_mapper=types_mapper)

def __array__(self, dtype=None):
def __array__(self, dtype=None, copy=None):
# TODO honor the copy=True case
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we raise an exception for now? Also, can you open a GH issue for it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we raise an exception for now?

I wouldn't do that I think, because then when numpy would start passing that down in let's say numpy 2.1, np.array(obj) would start erroring, while those never errored before (and for some cases this might incorrectly not return a copy (although still marked as read-only), many cases already do copy anyways)

Although to be honest, I don't really know what the strategy of numpy will be to enable this keyword. If they would just start to stop copying the result of __array__, that would cause such changes in many libraries with __array__s)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though ideally, we just directly implement copy=True as well (it was just lest critical for numpy 2.0 as it is not yet used, and also a bit harder to implement)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and also a bit harder to implement

On second thought, it might be relatively easy to determine if to_numpy() returned a copy or not? I was first thinking we would have to mimic the logic based on the type ("if primitive and no nulls, then it is zero copy"), but we might be able to check if the resulting ndarray has a base pointing to a pyarrow object?

Although the simple logic of numeric+no nulls might be easier in practice.

@@ -525,7 +525,8 @@ cdef class ChunkedArray(_PandasConvertible):

return values

def __array__(self, dtype=None):
def __array__(self, dtype=None, copy=None):
# copy keyword can be ignored because to_numpy() already returns a copy
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, copy=False should then raise an error, no?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, that comment is from before I decided to handle the copy=False case for Array. Indeed we should just raise an error here that a no-copy is not possible.

Updated.

@@ -1533,7 +1534,8 @@ cdef class _Tabular(_PandasConvertible):
raise TypeError(f"Do not call {self.__class__.__name__}'s constructor directly, use "
f"one of the `{self.__class__.__name__}.from_*` functions instead.")

def __array__(self, dtype=None):
def __array__(self, dtype=None, copy=None):
# copy keyword can be ignored as this always already returns a copy
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here?

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Apr 9, 2024
@github-actions github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Apr 9, 2024
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Apr 11, 2024
values = self.to_numpy(zero_copy_only=False)
if copy is True and is_primitive(self.type.id) and self.null_count == 0:
# to_numpy did not yet make a copy
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this make a second copy for e.g. boolean?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, boolean is not considered as primitive (although you might expect that). So a boolean array takes the code path below just returning the numpy values is no dtype was required, so no second copy.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, sorry, boolean is of course "primitive", what I meant to say is that it is not "numeric", and the code above was meant to use is_numeric (I added that to libarrow.pxd for that purpose). Good catch ..

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Apr 15, 2024
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Apr 15, 2024
@jorisvandenbossche
Copy link
Member Author

@github-actions crossbow submit test-conda-python-3.11-pandas-upstream_devel test-conda-python-3.10-pandas-nightly

Copy link

Revision: 4b630f6

Submitted crossbow builds: ursacomputing/crossbow @ actions-0fad67714d

Task Status
test-conda-python-3.10-pandas-nightly GitHub Actions
test-conda-python-3.11-pandas-upstream_devel GitHub Actions

@jorisvandenbossche
Copy link
Member Author

(the failure is the flaky join_asof test)

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pitrou pitrou merged commit 80258fe into apache:main Apr 15, 2024
12 checks passed
@pitrou pitrou removed the awaiting change review Awaiting change review label Apr 15, 2024
@github-actions github-actions bot added the awaiting committer review Awaiting committer review label Apr 15, 2024
raulcd pushed a commit that referenced this pull request Apr 15, 2024
… compatibility (#41071)

### Rationale for this change

Adapting for changes in numpy 2.0 as decribed at https://numpy.org/devdocs/numpy_2_0_migration_guide.html#adapting-to-changes-in-the-copy-keyword and future changes to pass copy=True (numpy/numpy#26208)

### What changes are included in this PR?

Add a `copy=None` to the signatures of our `__array__` methods.

This does have impact on the user facing behaviour, though. Questioning that upstream at numpy/numpy#25941 (comment)

### Are these changes tested?

Yes

### Are there any user-facing changes?

No (compared to usage with numpy<2)
* GitHub Issue: #39532
* GitHub Issue: #41098

Authored-by: Joris Van den Bossche <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
Copy link

After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 80258fe.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 6 possible false positives for unstable benchmarks that are known to sometimes produce them.

tolleybot pushed a commit to tmct/arrow that referenced this pull request May 2, 2024
…y 2.0+ compatibility (apache#41071)

### Rationale for this change

Adapting for changes in numpy 2.0 as decribed at https://numpy.org/devdocs/numpy_2_0_migration_guide.html#adapting-to-changes-in-the-copy-keyword and future changes to pass copy=True (numpy/numpy#26208)

### What changes are included in this PR?

Add a `copy=None` to the signatures of our `__array__` methods.

This does have impact on the user facing behaviour, though. Questioning that upstream at numpy/numpy#25941 (comment)

### Are these changes tested?

Yes

### Are there any user-facing changes?

No (compared to usage with numpy<2)
* GitHub Issue: apache#39532
* GitHub Issue: apache#41098

Authored-by: Joris Van den Bossche <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
vibhatha pushed a commit to vibhatha/arrow that referenced this pull request May 25, 2024
…y 2.0+ compatibility (apache#41071)

### Rationale for this change

Adapting for changes in numpy 2.0 as decribed at https://numpy.org/devdocs/numpy_2_0_migration_guide.html#adapting-to-changes-in-the-copy-keyword and future changes to pass copy=True (numpy/numpy#26208)

### What changes are included in this PR?

Add a `copy=None` to the signatures of our `__array__` methods.

This does have impact on the user facing behaviour, though. Questioning that upstream at numpy/numpy#25941 (comment)

### Are these changes tested?

Yes

### Are there any user-facing changes?

No (compared to usage with numpy<2)
* GitHub Issue: apache#39532
* GitHub Issue: apache#41098

Authored-by: Joris Van den Bossche <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants