-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Datasets] [Arrow 7+ Support - 3/N] Add support for Arrow 8, 9, 10, and nightly. #29999
[Datasets] [Arrow 7+ Support - 3/N] Add support for Arrow 8, 9, 10, and nightly. #29999
Conversation
f822d68
to
416f145
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for splitting this out. It's more reviewable.
@@ -368,6 +369,9 @@ def _init_filesystem(create_valid_file: bool = False, check_valid_file: bool = T | |||
fs_creator = _load_class(parsed_uri.netloc) | |||
_filesystem, _storage_prefix = fs_creator(parsed_uri.path) | |||
else: | |||
# Arrow's S3FileSystem doesn't allow creating buckets by default, so we add a | |||
# query arg enabling bucket creation if an S3 URI is provided. | |||
_storage_uri = _add_creatable_buckets_param_if_s3_uri(_storage_uri) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure the exact semantics of ray.init(storage=uri), but is this desired to always create bucket?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those were the old semantics, and we have tests relying on that behavior, so this is just preserving the existing semantics of creating a bucket if necessary.
return ( | ||
PYARROW_VERSION is None | ||
or PYARROW_VERSION >= MIN_PYARROW_VERSION_SCALAR_SUBCLASS | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jianoaix Added the extension scalar support utilities.
# methods isn't allowed. | ||
if isinstance(key, slice): | ||
return super().__getitem__(key) | ||
return self._to_numpy(key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jianoaix Consolidated the __getitem__
override into a mixin that can be shared between ArrowTensorArray
and ArrowVariableShapedTensorArray
. This should also be very easy to remove once we support Arrow 9.0.0+ (just delete the mixin).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this looks nice to contain it and make it easily removable!
""" | ||
ExtensionScalar subclass with custom logic for this array of tensors type. | ||
""" | ||
return ArrowTensorScalar |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that we have ArrowTensorScalar.as_py()
delegating to self.type._extension_scalar_to_ndarray()
, we can use the same extension scalar type for both ArrowTensorArray
and ArrowVariableShapedTensorArray
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall.
…o mixin, use ArrowTensorScalar for both tensor extension arrays.
7726596
to
a059af7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
# methods isn't allowed. | ||
if isinstance(key, slice): | ||
return super().__getitem__(key) | ||
return self._to_numpy(key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this looks nice to contain it and make it easily removable!
@clarkzinzow |
…nd nightly. (ray-project#29999) This PR adds support for Arrow 8, 9, 10, and nightly in Ray, and is the third PR in a set of stacked PRs making up this mono-PR for Arrow 7+ support (ray-project#29161), and is stacked on top of a PR fixing task cancellation in Ray Core (ray-project#29984) and a PR adding support for Arrow 7 (ray-project#29993). The last two commits are the relevant commits for review. Summary of Changes This PR: - For Arrow 9+, add allow_bucket_creation=true to S3 URIs for the Ray Core Storage API and for the Datasets S3 write API ([Datasets] In Arrow 9+, creating S3 buckets requires explicit opt-in. ray-project#29815). - For Arrow 9+, create an ExtensionScalar subclass for tensor extension types that returns an ndarray view from .as_py() ([Datasets] For Arrow 8+, tensor column element access returns an ExtensionScalar. ray-project#29816). - For Arrow 8.*, we manually convert the ExtensionScalar to an ndarray for tensor extension types, since the ExtensionScalar type exists but isn't subclassable in Arrow 8 ([Datasets] For Arrow 8+, tensor column element access returns an ExtensionScalar. ray-project#29816). - For Arrow 10+, we match on other potential error messages when encountering permission issues when interacting with S3 ([Datasets] In Arrow 10+, S3 errors raised due to permission issues can vary beyond our current pattern matching ray-project#29994). - adds CI jobs for Arrow 8, 9, 10, and nightly - removes the pyarrow version upper bound Signed-off-by: Weichen Xu <[email protected]>
@h-vetinari Ah that does indeed look like an oversight, thank you for highlighting that! I think this was missed in our CI testing of Arrow 8+ since we do a manual Arrow version override after installing Ray; if you're installing Ray via I'll open a hotfix PR that removes that upper-bound. |
This PR adds support for Arrow 8, 9, 10, and nightly in Ray, and is the third PR in a set of stacked PRs making up this mono-PR for Arrow 7+ support (#29161), and is stacked on top of a PR fixing task cancellation in Ray Core (#29984) and a PR adding support for Arrow 7 (#29993). The last two commits are the relevant commits for review.
Summary of Changes
This PR:
allow_bucket_creation=true
to S3 URIs for the Ray Core Storage API and for the Datasets S3 write API ([Datasets] In Arrow 9+, creating S3 buckets requires explicit opt-in. #29815).ExtensionScalar
subclass for tensor extension types that returns an ndarray view from.as_py()
([Datasets] For Arrow 8+, tensor column element access returns anExtensionScalar
. #29816).For Arrow 8.*, we manually convert the
ExtensionScalar
to an ndarray for tensor extension types, since theExtensionScalar
type exists but isn't subclassable in Arrow 8 ([Datasets] For Arrow 8+, tensor column element access returns anExtensionScalar
. #29816).Related issue number
Closes #29816, closes #29815, closes #29994, closes #29995, closes #29996, closes #29997, closes #29998
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.