[Datasets] [Arrow 7+ Support - 3/N] Add support for Arrow 8, 9, 10, and nightly. #29999

clarkzinzow · 2022-11-03T20:56:40Z

This PR adds support for Arrow 8, 9, 10, and nightly in Ray, and is the third PR in a set of stacked PRs making up this mono-PR for Arrow 7+ support (#29161), and is stacked on top of a PR fixing task cancellation in Ray Core (#29984) and a PR adding support for Arrow 7 (#29993). The last two commits are the relevant commits for review.

Summary of Changes

This PR:

For Arrow 9+, add allow_bucket_creation=true to S3 URIs for the Ray Core Storage API and for the Datasets S3 write API ([Datasets] In Arrow 9+, creating S3 buckets requires explicit opt-in. #29815).
For Arrow 9+, create an ExtensionScalar subclass for tensor extension types that returns an ndarray view from .as_py() ([Datasets] For Arrow 8+, tensor column element access returns an ExtensionScalar. #29816).
For Arrow 8.*, we manually convert the ExtensionScalar to an ndarray for tensor extension types, since the ExtensionScalar type exists but isn't subclassable in Arrow 8 ([Datasets] For Arrow 8+, tensor column element access returns an ExtensionScalar. #29816).
For Arrow 10+, we match on other potential error messages when encountering permission issues when interacting with S3 ([Datasets] In Arrow 10+, S3 errors raised due to permission issues can vary beyond our current pattern matching #29994).
adds CI jobs for Arrow 8, 9, 10, and nightly
removes the pyarrow version upper bound

Related issue number

Closes #29816, closes #29815, closes #29994, closes #29995, closes #29996, closes #29997, closes #29998

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

jianoaix

Thanks for splitting this out. It's more reviewable.

python/ray/air/util/tensor_extensions/arrow.py

jianoaix · 2022-11-09T04:46:20Z

python/ray/_private/storage.py

@@ -368,6 +369,9 @@ def _init_filesystem(create_valid_file: bool = False, check_valid_file: bool = T
        fs_creator = _load_class(parsed_uri.netloc)
        _filesystem, _storage_prefix = fs_creator(parsed_uri.path)
    else:
+        # Arrow's S3FileSystem doesn't allow creating buckets by default, so we add a
+        # query arg enabling bucket creation if an S3 URI is provided.
+        _storage_uri = _add_creatable_buckets_param_if_s3_uri(_storage_uri)


Not sure the exact semantics of ray.init(storage=uri), but is this desired to always create bucket?

Those were the old semantics, and we have tests relying on that behavior, so this is just preserving the existing semantics of creating a bucket if necessary.

clarkzinzow · 2022-11-09T15:16:56Z

python/ray/air/util/tensor_extensions/arrow.py

+    return (
+        PYARROW_VERSION is None
+        or PYARROW_VERSION >= MIN_PYARROW_VERSION_SCALAR_SUBCLASS
+    )


@jianoaix Added the extension scalar support utilities.

clarkzinzow · 2022-11-09T15:17:57Z

python/ray/air/util/tensor_extensions/arrow.py

+                # methods isn't allowed.
+                if isinstance(key, slice):
+                    return super().__getitem__(key)
+                return self._to_numpy(key)


@jianoaix Consolidated the __getitem__ override into a mixin that can be shared between ArrowTensorArray and ArrowVariableShapedTensorArray. This should also be very easy to remove once we support Arrow 9.0.0+ (just delete the mixin).

Yeah, this looks nice to contain it and make it easily removable!

clarkzinzow · 2022-11-09T15:19:50Z

python/ray/air/util/tensor_extensions/arrow.py

+            """
+            ExtensionScalar subclass with custom logic for this array of tensors type.
+            """
+            return ArrowTensorScalar


Now that we have ArrowTensorScalar.as_py() delegating to self.type._extension_scalar_to_ndarray(), we can use the same extension scalar type for both ArrowTensorArray and ArrowVariableShapedTensorArray.

c21

LGTM overall.

python/ray/_private/utils.py

…o mixin, use ArrowTensorScalar for both tensor extension arrays.

jianoaix

LGTM!

jianoaix · 2022-11-09T21:40:34Z

python/ray/air/util/tensor_extensions/arrow.py

+                # methods isn't allowed.
+                if isinstance(key, slice):
+                    return super().__getitem__(key)
+                return self._to_numpy(key)


Yeah, this looks nice to contain it and make it easily removable!

h-vetinari · 2022-12-19T06:28:02Z

@clarkzinzow
It seems this PR did not update the version guards in setup.py, even though it did so in requirements.txt. Could you confirm that this was just an oversight or is there anything still missing for pyarrow>=8 support?

…nd nightly. (ray-project#29999) This PR adds support for Arrow 8, 9, 10, and nightly in Ray, and is the third PR in a set of stacked PRs making up this mono-PR for Arrow 7+ support (ray-project#29161), and is stacked on top of a PR fixing task cancellation in Ray Core (ray-project#29984) and a PR adding support for Arrow 7 (ray-project#29993). The last two commits are the relevant commits for review. Summary of Changes This PR: - For Arrow 9+, add allow_bucket_creation=true to S3 URIs for the Ray Core Storage API and for the Datasets S3 write API ([Datasets] In Arrow 9+, creating S3 buckets requires explicit opt-in. ray-project#29815). - For Arrow 9+, create an ExtensionScalar subclass for tensor extension types that returns an ndarray view from .as_py() ([Datasets] For Arrow 8+, tensor column element access returns an ExtensionScalar. ray-project#29816). - For Arrow 8.*, we manually convert the ExtensionScalar to an ndarray for tensor extension types, since the ExtensionScalar type exists but isn't subclassable in Arrow 8 ([Datasets] For Arrow 8+, tensor column element access returns an ExtensionScalar. ray-project#29816). - For Arrow 10+, we match on other potential error messages when encountering permission issues when interacting with S3 ([Datasets] In Arrow 10+, S3 errors raised due to permission issues can vary beyond our current pattern matching ray-project#29994). - adds CI jobs for Arrow 8, 9, 10, and nightly - removes the pyarrow version upper bound Signed-off-by: Weichen Xu <[email protected]>

clarkzinzow · 2022-12-19T17:01:09Z

@h-vetinari Ah that does indeed look like an oversight, thank you for highlighting that! I think this was missed in our CI testing of Arrow 8+ since we do a manual Arrow version override after installing Ray; if you're installing Ray via pip install ray[data], you can safely override the installed Arrow version with any Arrow 8+ versions.

I'll open a hotfix PR that removes that upper-bound.

clarkzinzow requested review from ericl and scv119 as code owners November 3, 2022 20:56

clarkzinzow assigned c21 Nov 3, 2022

clarkzinzow requested review from jjyao, jianoaix, c21, richardliaw and edoakes as code owners November 3, 2022 20:56

clarkzinzow assigned jianoaix Nov 3, 2022

clarkzinzow force-pushed the datasets/feat/arrow-8-9-10-support branch 4 times, most recently from f822d68 to 416f145 Compare November 8, 2022 03:38

clarkzinzow mentioned this pull request Nov 8, 2022

[Datasets] [Arrow 7.0.0+ Support] [Mono-PR] Add support for Arrow 7+. #29161

Closed

13 tasks

clarkzinzow changed the title ~~[Datasets] [Arrow 7+ Support] [3/N] Add support for Arrow 8, 9, 10, and nightly.~~ [Datasets] [Arrow 7+ Support - 3/N] Add support for Arrow 8, 9, 10, and nightly. Nov 8, 2022

clarkzinzow added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Nov 8, 2022

jianoaix reviewed Nov 9, 2022

View reviewed changes

clarkzinzow commented Nov 9, 2022

View reviewed changes

clarkzinzow requested a review from jianoaix November 9, 2022 15:20

c21 approved these changes Nov 9, 2022

View reviewed changes

python/ray/_private/utils.py Outdated Show resolved Hide resolved

python/ray/_private/utils.py Outdated Show resolved Hide resolved

clarkzinzow added 4 commits November 9, 2022 21:13

Add support for Arrow 8, 9, 10, and nightly.

b5743d6

Remove pyarrow version upper bound.

b836e94

Clean up version-switching, consolidate scalar indexing overrides int…

cf51949

…o mixin, use ArrowTensorScalar for both tensor extension arrays.

PR feedback.

a059af7

clarkzinzow force-pushed the datasets/feat/arrow-8-9-10-support branch from 7726596 to a059af7 Compare November 9, 2022 21:30

jianoaix approved these changes Nov 9, 2022

View reviewed changes

clarkzinzow merged commit 06d5dc3 into ray-project:master Nov 9, 2022

clarkzinzow mentioned this pull request Nov 9, 2022

[Datasets] Add CI jobs covering major Arrow versions we support. #29822

Closed

clarkzinzow mentioned this pull request Dec 19, 2022

[Datasets] [Hotfix] Remove Arrow 8 upper-bound in ray[data] extra #31186

Merged

7 tasks

clarkzinzow mentioned this pull request Jan 26, 2023

[Datasets] Add support for using Arrow 10 with Ray (Core/Datasets/AIR). #29997

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Datasets] [Arrow 7+ Support - 3/N] Add support for Arrow 8, 9, 10, and nightly. #29999

[Datasets] [Arrow 7+ Support - 3/N] Add support for Arrow 8, 9, 10, and nightly. #29999

clarkzinzow commented Nov 3, 2022 •

edited

Loading

jianoaix left a comment

jianoaix Nov 9, 2022

clarkzinzow Nov 9, 2022

clarkzinzow Nov 9, 2022

clarkzinzow Nov 9, 2022 •

edited

Loading

jianoaix Nov 9, 2022

clarkzinzow Nov 9, 2022

c21 left a comment

jianoaix left a comment

jianoaix Nov 9, 2022

h-vetinari commented Dec 19, 2022

clarkzinzow commented Dec 19, 2022

[Datasets] [Arrow 7+ Support - 3/N] Add support for Arrow 8, 9, 10, and nightly. #29999

[Datasets] [Arrow 7+ Support - 3/N] Add support for Arrow 8, 9, 10, and nightly. #29999

Conversation

clarkzinzow commented Nov 3, 2022 • edited Loading

Summary of Changes

Related issue number

Checks

jianoaix left a comment

Choose a reason for hiding this comment

jianoaix Nov 9, 2022

Choose a reason for hiding this comment

clarkzinzow Nov 9, 2022

Choose a reason for hiding this comment

clarkzinzow Nov 9, 2022

Choose a reason for hiding this comment

clarkzinzow Nov 9, 2022 • edited Loading

Choose a reason for hiding this comment

jianoaix Nov 9, 2022

Choose a reason for hiding this comment

clarkzinzow Nov 9, 2022

Choose a reason for hiding this comment

c21 left a comment

Choose a reason for hiding this comment

jianoaix left a comment

Choose a reason for hiding this comment

jianoaix Nov 9, 2022

Choose a reason for hiding this comment

h-vetinari commented Dec 19, 2022

clarkzinzow commented Dec 19, 2022

clarkzinzow commented Nov 3, 2022 •

edited

Loading

clarkzinzow Nov 9, 2022 •

edited

Loading