Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Datasets] For Arrow 8+, tensor column element access returns an ExtensionScalar. #29816

Closed
clarkzinzow opened this issue Oct 28, 2022 · 0 comments · Fixed by #29999
Closed
Assignees
Labels
bug Something that is supposed to be working; but isn't data Ray Data-related issues P1 Issue that should be fixed within a few weeks

Comments

@clarkzinzow
Copy link
Contributor

In Arrow 8+, tensor column element access returns an ExtensionScalar rather than the simple __getitem__ result. We need to convert this to a NumPy ndarray before relaying it to the user.

For Arrow 9+, ExtensionScalar is subclassable, allowing us to return an ndarray view from .as_py().

For Arrow 8.*, ExtensionScalar is not subclassable, and we need to manually convert it to an ndarray.

@clarkzinzow clarkzinzow added bug Something that is supposed to be working; but isn't P1 Issue that should be fixed within a few weeks air data Ray Data-related issues labels Oct 28, 2022
@clarkzinzow clarkzinzow added this to the Arrow 7+ Support milestone Oct 28, 2022
@clarkzinzow clarkzinzow self-assigned this Oct 28, 2022
clarkzinzow added a commit that referenced this issue Nov 9, 2022
…nd nightly. (#29999)

This PR adds support for Arrow 8, 9, 10, and nightly in Ray, and is the third PR in a set of stacked PRs making up this mono-PR for Arrow 7+ support (#29161), and is stacked on top of a PR fixing task cancellation in Ray Core (#29984) and a PR adding support for Arrow 7 (#29993). The last two commits are the relevant commits for review.

Summary of Changes

This PR:

- For Arrow 9+, add allow_bucket_creation=true to S3 URIs for the Ray Core Storage API and for the Datasets S3 write API ([Datasets] In Arrow 9+, creating S3 buckets requires explicit opt-in. #29815).
- For Arrow 9+, create an ExtensionScalar subclass for tensor extension types that returns an ndarray view from .as_py() ([Datasets] For Arrow 8+, tensor column element access returns an ExtensionScalar. #29816).
- For Arrow 8.*, we manually convert the ExtensionScalar to an ndarray for tensor extension types, since the ExtensionScalar type exists but isn't subclassable in Arrow 8 ([Datasets] For Arrow 8+, tensor column element access returns an ExtensionScalar. #29816).
- For Arrow 10+, we match on other potential error messages when encountering permission issues when interacting with S3 ([Datasets] In Arrow 10+, S3 errors raised due to permission issues can vary beyond our current pattern matching #29994).
- adds CI jobs for Arrow 8, 9, 10, and nightly
- removes the pyarrow version upper bound
WeichenXu123 pushed a commit to WeichenXu123/ray that referenced this issue Dec 19, 2022
…nd nightly. (ray-project#29999)

This PR adds support for Arrow 8, 9, 10, and nightly in Ray, and is the third PR in a set of stacked PRs making up this mono-PR for Arrow 7+ support (ray-project#29161), and is stacked on top of a PR fixing task cancellation in Ray Core (ray-project#29984) and a PR adding support for Arrow 7 (ray-project#29993). The last two commits are the relevant commits for review.

Summary of Changes

This PR:

- For Arrow 9+, add allow_bucket_creation=true to S3 URIs for the Ray Core Storage API and for the Datasets S3 write API ([Datasets] In Arrow 9+, creating S3 buckets requires explicit opt-in. ray-project#29815).
- For Arrow 9+, create an ExtensionScalar subclass for tensor extension types that returns an ndarray view from .as_py() ([Datasets] For Arrow 8+, tensor column element access returns an ExtensionScalar. ray-project#29816).
- For Arrow 8.*, we manually convert the ExtensionScalar to an ndarray for tensor extension types, since the ExtensionScalar type exists but isn't subclassable in Arrow 8 ([Datasets] For Arrow 8+, tensor column element access returns an ExtensionScalar. ray-project#29816).
- For Arrow 10+, we match on other potential error messages when encountering permission issues when interacting with S3 ([Datasets] In Arrow 10+, S3 errors raised due to permission issues can vary beyond our current pattern matching ray-project#29994).
- adds CI jobs for Arrow 8, 9, 10, and nightly
- removes the pyarrow version upper bound

Signed-off-by: Weichen Xu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't data Ray Data-related issues P1 Issue that should be fixed within a few weeks
Projects
None yet
2 participants