Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Datasets] In Arrow 9+, creating S3 buckets requires explicit opt-in. #29815

Closed
clarkzinzow opened this issue Oct 28, 2022 · 0 comments · Fixed by #29999
Closed

[Datasets] In Arrow 9+, creating S3 buckets requires explicit opt-in. #29815

clarkzinzow opened this issue Oct 28, 2022 · 0 comments · Fixed by #29999
Assignees
Labels
bug Something that is supposed to be working; but isn't data Ray Data-related issues P1 Issue that should be fixed within a few weeks

Comments

@clarkzinzow
Copy link
Contributor

In Arrow 9+, creating S3 buckets requires the allow_create_buckets=True option to be enabled when instantiating the S3FileSystem, or requires passing a URI with the allow_create_buckets=true query parameter set. We need to add this automatically for S3 URIs when having to create S3 buckets on behalf of the user.

@clarkzinzow clarkzinzow added bug Something that is supposed to be working; but isn't P1 Issue that should be fixed within a few weeks air data Ray Data-related issues labels Oct 28, 2022
@clarkzinzow clarkzinzow added this to the Arrow 7+ Support milestone Oct 28, 2022
@clarkzinzow clarkzinzow self-assigned this Oct 28, 2022
clarkzinzow added a commit that referenced this issue Nov 9, 2022
…nd nightly. (#29999)

This PR adds support for Arrow 8, 9, 10, and nightly in Ray, and is the third PR in a set of stacked PRs making up this mono-PR for Arrow 7+ support (#29161), and is stacked on top of a PR fixing task cancellation in Ray Core (#29984) and a PR adding support for Arrow 7 (#29993). The last two commits are the relevant commits for review.

Summary of Changes

This PR:

- For Arrow 9+, add allow_bucket_creation=true to S3 URIs for the Ray Core Storage API and for the Datasets S3 write API ([Datasets] In Arrow 9+, creating S3 buckets requires explicit opt-in. #29815).
- For Arrow 9+, create an ExtensionScalar subclass for tensor extension types that returns an ndarray view from .as_py() ([Datasets] For Arrow 8+, tensor column element access returns an ExtensionScalar. #29816).
- For Arrow 8.*, we manually convert the ExtensionScalar to an ndarray for tensor extension types, since the ExtensionScalar type exists but isn't subclassable in Arrow 8 ([Datasets] For Arrow 8+, tensor column element access returns an ExtensionScalar. #29816).
- For Arrow 10+, we match on other potential error messages when encountering permission issues when interacting with S3 ([Datasets] In Arrow 10+, S3 errors raised due to permission issues can vary beyond our current pattern matching #29994).
- adds CI jobs for Arrow 8, 9, 10, and nightly
- removes the pyarrow version upper bound
WeichenXu123 pushed a commit to WeichenXu123/ray that referenced this issue Dec 19, 2022
…nd nightly. (ray-project#29999)

This PR adds support for Arrow 8, 9, 10, and nightly in Ray, and is the third PR in a set of stacked PRs making up this mono-PR for Arrow 7+ support (ray-project#29161), and is stacked on top of a PR fixing task cancellation in Ray Core (ray-project#29984) and a PR adding support for Arrow 7 (ray-project#29993). The last two commits are the relevant commits for review.

Summary of Changes

This PR:

- For Arrow 9+, add allow_bucket_creation=true to S3 URIs for the Ray Core Storage API and for the Datasets S3 write API ([Datasets] In Arrow 9+, creating S3 buckets requires explicit opt-in. ray-project#29815).
- For Arrow 9+, create an ExtensionScalar subclass for tensor extension types that returns an ndarray view from .as_py() ([Datasets] For Arrow 8+, tensor column element access returns an ExtensionScalar. ray-project#29816).
- For Arrow 8.*, we manually convert the ExtensionScalar to an ndarray for tensor extension types, since the ExtensionScalar type exists but isn't subclassable in Arrow 8 ([Datasets] For Arrow 8+, tensor column element access returns an ExtensionScalar. ray-project#29816).
- For Arrow 10+, we match on other potential error messages when encountering permission issues when interacting with S3 ([Datasets] In Arrow 10+, S3 errors raised due to permission issues can vary beyond our current pattern matching ray-project#29994).
- adds CI jobs for Arrow 8, 9, 10, and nightly
- removes the pyarrow version upper bound

Signed-off-by: Weichen Xu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't data Ray Data-related issues P1 Issue that should be fixed within a few weeks
Projects
None yet
2 participants