[Datasets] [Arrow 7.0.0+ Support] [Mono-PR] Add support for Arrow 7+. #29161

clarkzinzow · 2022-10-07T16:02:54Z

This reverts commit 081ce2f, and unreverts PR #29055.

This PR adds support for Arrow 9 while preserving support for Arrow 6, 7, 8; we duplicate the Datasets CI job for each of these Arrow major versions to ensure this support.

Summary of major changes

We change the Arrow version bounds to be pyarrow >= 6.0.1 with no upper bound, except for Python 3.6 (Arrow dropped Python 3.6 support in Arrow 7) and Windows (Arrow has some blocker bugs in Arrow 7+) (link).
We duplicate the Datasets CI job for each major Arrow version we support: 6, 7, 8, and latest stable version (currently 9) (link).
For Arrow 9+, add allow_bucket_creation=true to S3 URIs for the Ray Core Storage API and for the Datasets S3 write API (link 1, link 2).
For Arrow 9+, create an ExtensionScalar subclass for tensor extension types that returns an ndarray view from .as_py() (link.
For Arrow 8.*, we manually convert the ExtensionScalar to an ndarray for tensor extension types, since the ExtensionScalar type exists but isn't subclassable in Arrow 8 (link 1, link 2).
For Arrow 10+, we match on other potential error messages when encountering permission issues when interacting with S3.
We override table concatenation and tensor extension array concatenation to ensure that concatenating tensor column chunks that each have homogeneous-shape but cumulatively have different shapes will result in a variable-shaped tensor column (link 1, link 2, link 3, link 4). Pulled out into separate PR.
Custom serialization hook for Arrow data types to work around Arrow serialization bug (pickling zero-copy views on Arrow data serializes the entire data buffer instead of truncating it) (link).
Adds batch mutation API to ds.map_batches(), which allows the user to opt in to zero-copy batching via allow_mutate_batch=False (link).
Fixes unhandled task cancellation when deserializing task arguments and storing task errors.

TODOs

Ensure that relevant AIR-level tests run under Datasets CI jobs (e.g. extension type tests) that are running under all major Arrow versions.
Ensure that Core tests that are sensitive to worker startup times are passing.
Ensure that pg_long_running_performance_test release test passes.
Decompose PR into stacked PRs.
(Post-merge) Open tracking issues for deprecating Arrow 6, 7, and 8.
(Follow-up PR) Consolidate logic handling of Arrow-version-specific things to a ray._private.arrow_compat to share between the Storage API, Ray AIR, and Ray Datasets.

Related Issue Number

Closes #29814, closes #29816, closes #29815, closes #29817, closes #29822

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

python/ray/air/util/tensor_extensions/arrow.py

scv119 · 2022-10-14T00:28:40Z

also let's run release tests; it has been causing regressions.

jianoaix

Thanks for taking care of those details of arrow versioning.

On the other hand, it gets quite complicated. I think we need an exit strategy.
How about we file a few well-documented bugs for handling the drop of each version? E.g. in the bug of dropping 8.0, we document what are the conditions to drop it, and what cleanups we need to take care of etc. This should make life a bit easier in the future.

jianoaix · 2022-10-19T03:44:52Z

.buildkite/pipeline.ml.yml

@@ -472,7 +472,7 @@
    # Dask tests and examples.
    - bazel test --config=ci $(./ci/run/bazel_export_options) --build_tests_only --test_tag_filters=-client python/ray/util/dask/...

- label: "Dataset tests"
+- label: "Dataset tests (Arrow - Latest Release)"


Is this duplication really warranted? We have other dependency which is also a range of versions and we have to ensure correctness as well, but they aren't multiplexed like this.

Since Arrow is extremely fundamental to Datasets and since there are a lot of breaking changes between Arrow versions, I think that it's warranted. We need to support Arrow 6 until we drop Python 3.6, and as you know, we have users that are stuck on Python 3.6 for the foreseeable future that we want to ensure are successful. Supporting Arrow 7 and 8 is minimal effort on top of Arrow 9, so I'm proposing that we start with full 6-9 coverage and then work on a deprecation plan that we can communicate to users. I think that the eventual state we'll want to get to is that we support the latest stable release that's supported for each Python version.

jianoaix · 2022-10-19T04:13:45Z

python/ray/_private/utils.py

+        return uri
+    parsed_uri = urlparse(uri)
+    if parsed_uri.scheme == "s3":
+        uri = _add_url_query_params(uri, {"allow_bucket_creation": True})


Should this override if there is existing allow_bucket_creation=False? If not, make it clear in the docstring about this semantics.

By default, this won't override if the query parameter already exists; the optional override argument controls this and is documented in the _add_url_query_params docstring.

clarkzinzow · 2022-10-19T18:10:32Z

How about we file a few well-documented bugs for handling the drop of each version? E.g. in the bug of dropping 8.0, we document what are the conditions to drop it, and what cleanups we need to take care of etc. This should make life a bit easier in the future.

@jianoaix Yep I plan to create those tracking issues once this is merged (i.e. once we've added support for Arrow 7+ in master)! I'll add that as a TODO for this PR + create draft issues.

… for pickling Arrow data to circumvent slice view buffer truncation bug. (ray-project#29055)" (ray-project#29138)" This reverts commit 081ce2f.

…d RecordBatches.

clarkzinzow · 2022-11-03T16:11:59Z

@scv119 FYI, I've confirmed that pg_long_running_performance_test, which was failing for the original PR in master, passes: https://buildkite.com/ray-project/release-tests-pr/builds/20110#01843e16-df16-47a9-a2d2-4fd32ea48750

…ent deserialization and task error storage. (#29984) Task execution currently doesn't handle KeyboardInterrupt (which is raised via the SIGINT signal upon task cancellation) during argument deserialization or task error storage, since we are only catching KeyboardInterrupt during actual execution of the task function. This can result in task cancellation causing the underlying worker to crash, since the task cancellation error may never be stored, causing a RAY_CHECK in the task transport to fail. Arguments that take a particularly long time to deserialize can cause this to be hit pretty consistently. This PR adds a top-level try-except on KeyboardInterrupt that covers (nearly) the entire execute_task function body, ensuring that the SIGINT is properly handled. This PR is the first PR in a set of stacked PRs making up this mono-PR for adding support for Arrow 7+ support in Ray: #29161

… Arrow serialization bug. (#29993) This PR adds support for Arrow 7 in Ray, and is the second PR in a set of stacked PRs making up this mono-PR for Arrow 7+ support: #29161, and is stacked on top of a PR fixing task cancellation in Ray Core: #29984. This PR: - fixes a serialization bug in Arrow with a custom serializer for Arrow data ([Datasets] Arrow data buffers aren't truncated when pickling zero-copy slice views, leading to huge serialization bloat #29814) - removes a bunch of defensive copying of Arrow data, which was a workaround for the aforementioned Arrow serialization bug - adds a CI job for Arrow 7 - bumps the pyarrow upper bound to 8.0.0

h-vetinari · 2022-11-08T09:15:24Z

Just a note: This is becoming a topic of high importance in conda-forge. Arrow 10 was recently released, and we cannot reasonably support more than (an already exceptionally high) 4 arrow versions at the same time, for all the required migrations that are happening constantly. That means conda-forge will very soon be arrow 7+ only, and therefore ray will become progressively harder to install with other up-to-date packages, unless & until something like this PR gets merged & released.

clarkzinzow · 2022-11-08T12:33:11Z

Thanks for the conda-forge context @h-vetinari! The good news is that this mono-PR has been decomposed into a stack of PRs, and the PR adding support for Arrow 7 has just been merged: #29993

The next PR in the stack adds support for Arrow 8 through Arrow nightly. #29999

So when conda-forge drops Arrow 6 support, Ray should hopefully be in a position to support any of the more recent major Arrow versions.

clarkzinzow · 2022-11-08T18:27:27Z

Closing this PR in favor of the stack:

[Core] [Arrow 7+ Support - 1/N] Handle task cancellation during argument deserialization and task error storage. #29984
[Datasets] [Arrow 7+ Support - 2/N] Add support for Arrow 7 by fixing Arrow serialization bug. #29993
[Datasets] [Arrow 7+ Support - 3/N] Add support for Arrow 8, 9, 10, and nightly. #29999
[Datasets] [Arrow 7+ Support - 4/N] [RFC] Add zero-copy batch API for ds.map_batches(). #30000

…nd nightly. (#29999) This PR adds support for Arrow 8, 9, 10, and nightly in Ray, and is the third PR in a set of stacked PRs making up this mono-PR for Arrow 7+ support (#29161), and is stacked on top of a PR fixing task cancellation in Ray Core (#29984) and a PR adding support for Arrow 7 (#29993). The last two commits are the relevant commits for review. Summary of Changes This PR: - For Arrow 9+, add allow_bucket_creation=true to S3 URIs for the Ray Core Storage API and for the Datasets S3 write API ([Datasets] In Arrow 9+, creating S3 buckets requires explicit opt-in. #29815). - For Arrow 9+, create an ExtensionScalar subclass for tensor extension types that returns an ndarray view from .as_py() ([Datasets] For Arrow 8+, tensor column element access returns an ExtensionScalar. #29816). - For Arrow 8.*, we manually convert the ExtensionScalar to an ndarray for tensor extension types, since the ExtensionScalar type exists but isn't subclassable in Arrow 8 ([Datasets] For Arrow 8+, tensor column element access returns an ExtensionScalar. #29816). - For Arrow 10+, we match on other potential error messages when encountering permission issues when interacting with S3 ([Datasets] In Arrow 10+, S3 errors raised due to permission issues can vary beyond our current pattern matching #29994). - adds CI jobs for Arrow 8, 9, 10, and nightly - removes the pyarrow version upper bound

…ent deserialization and task error storage. (ray-project#29984) Task execution currently doesn't handle KeyboardInterrupt (which is raised via the SIGINT signal upon task cancellation) during argument deserialization or task error storage, since we are only catching KeyboardInterrupt during actual execution of the task function. This can result in task cancellation causing the underlying worker to crash, since the task cancellation error may never be stored, causing a RAY_CHECK in the task transport to fail. Arguments that take a particularly long time to deserialize can cause this to be hit pretty consistently. This PR adds a top-level try-except on KeyboardInterrupt that covers (nearly) the entire execute_task function body, ensuring that the SIGINT is properly handled. This PR is the first PR in a set of stacked PRs making up this mono-PR for adding support for Arrow 7+ support in Ray: ray-project#29161 Signed-off-by: Weichen Xu <[email protected]>

… Arrow serialization bug. (ray-project#29993) This PR adds support for Arrow 7 in Ray, and is the second PR in a set of stacked PRs making up this mono-PR for Arrow 7+ support: ray-project#29161, and is stacked on top of a PR fixing task cancellation in Ray Core: ray-project#29984. This PR: - fixes a serialization bug in Arrow with a custom serializer for Arrow data ([Datasets] Arrow data buffers aren't truncated when pickling zero-copy slice views, leading to huge serialization bloat ray-project#29814) - removes a bunch of defensive copying of Arrow data, which was a workaround for the aforementioned Arrow serialization bug - adds a CI job for Arrow 7 - bumps the pyarrow upper bound to 8.0.0 Signed-off-by: Weichen Xu <[email protected]>

…nd nightly. (ray-project#29999) This PR adds support for Arrow 8, 9, 10, and nightly in Ray, and is the third PR in a set of stacked PRs making up this mono-PR for Arrow 7+ support (ray-project#29161), and is stacked on top of a PR fixing task cancellation in Ray Core (ray-project#29984) and a PR adding support for Arrow 7 (ray-project#29993). The last two commits are the relevant commits for review. Summary of Changes This PR: - For Arrow 9+, add allow_bucket_creation=true to S3 URIs for the Ray Core Storage API and for the Datasets S3 write API ([Datasets] In Arrow 9+, creating S3 buckets requires explicit opt-in. ray-project#29815). - For Arrow 9+, create an ExtensionScalar subclass for tensor extension types that returns an ndarray view from .as_py() ([Datasets] For Arrow 8+, tensor column element access returns an ExtensionScalar. ray-project#29816). - For Arrow 8.*, we manually convert the ExtensionScalar to an ndarray for tensor extension types, since the ExtensionScalar type exists but isn't subclassable in Arrow 8 ([Datasets] For Arrow 8+, tensor column element access returns an ExtensionScalar. ray-project#29816). - For Arrow 10+, we match on other potential error messages when encountering permission issues when interacting with S3 ([Datasets] In Arrow 10+, S3 errors raised due to permission issues can vary beyond our current pattern matching ray-project#29994). - adds CI jobs for Arrow 8, 9, 10, and nightly - removes the pyarrow version upper bound Signed-off-by: Weichen Xu <[email protected]>

clarkzinzow requested review from ericl, scv119, jjyao, jianoaix, c21, richardliaw and edoakes as code owners October 7, 2022 16:02

clarkzinzow mentioned this pull request Oct 7, 2022

[Datasets] Downgrade Arrow to < 7.0.0. #29140

Closed

7 tasks

clarkzinzow force-pushed the datasets/unrevert/arrow-serialization-ipc branch 4 times, most recently from 671896b to bc55881 Compare October 7, 2022 23:18

jianoaix mentioned this pull request Oct 8, 2022

[Datasets] Add MongoDB as a data source #28550

Merged

7 tasks

clarkzinzow force-pushed the datasets/unrevert/arrow-serialization-ipc branch 2 times, most recently from 40c55ff to 43e8d7b Compare October 13, 2022 17:12

clarkzinzow commented Oct 13, 2022

View reviewed changes

python/ray/air/util/tensor_extensions/arrow.py Show resolved Hide resolved

clarkzinzow force-pushed the datasets/unrevert/arrow-serialization-ipc branch from 3b41046 to 49ddd31 Compare October 13, 2022 22:17

fishbone mentioned this pull request Oct 13, 2022

[core] Fix test failure when there is a zombie in Mac #29344

Merged

7 tasks

clarkzinzow force-pushed the datasets/unrevert/arrow-serialization-ipc branch from 49ddd31 to 5f612b4 Compare October 13, 2022 22:47

clarkzinzow force-pushed the datasets/unrevert/arrow-serialization-ipc branch 5 times, most recently from ee0f5cd to 9356e55 Compare October 18, 2022 20:06

jianoaix reviewed Oct 19, 2022

View reviewed changes

clarkzinzow force-pushed the datasets/unrevert/arrow-serialization-ipc branch 3 times, most recently from 63f0631 to 9656624 Compare October 19, 2022 18:09

clarkzinzow force-pushed the datasets/unrevert/arrow-serialization-ipc branch 3 times, most recently from 107f139 to 278c59c Compare November 2, 2022 23:14

clarkzinzow added 4 commits November 3, 2022 13:18

Revert "Revert "[Datasets] Arrow 7.0.0+ Support: Use Arrow IPC format…

dc62918

… for pickling Arrow data to circumvent slice view buffer truncation bug. (ray-project#29055)" (ray-project#29138)" This reverts commit 081ce2f.

Updates

4fffd99

Remove custom serialization hooks for Arrow Arrays, ChunkedArrays, an…

c12dd8c

…d RecordBatches.

Handle task cancellation during argument deserialization.

381e4cb

clarkzinzow force-pushed the datasets/unrevert/arrow-serialization-ipc branch from 278c59c to 381e4cb Compare November 3, 2022 13:18

clarkzinzow changed the title ~~[Datasets] Unrevert "Arrow 7.0.0+ Support: Use Arrow IPC format for pickling Arrow data to circumvent slice view buffer truncation bug. (#29055)"~~ [Datasets] [Arrow 7.0.0+ Support] [Mono-PR] Add support for Arrow 7+. Nov 3, 2022

h-vetinari mentioned this pull request Nov 8, 2022

Unpin arrow ASAP conda-forge/ray-packages-feedstock#79

Closed

clarkzinzow closed this Nov 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Datasets] [Arrow 7.0.0+ Support] [Mono-PR] Add support for Arrow 7+. #29161

[Datasets] [Arrow 7.0.0+ Support] [Mono-PR] Add support for Arrow 7+. #29161

clarkzinzow commented Oct 7, 2022 •

edited

Loading

scv119 commented Oct 14, 2022

jianoaix left a comment

jianoaix Oct 19, 2022

clarkzinzow Oct 19, 2022 •

edited

Loading

jianoaix Oct 19, 2022

clarkzinzow Oct 19, 2022

clarkzinzow commented Oct 19, 2022

clarkzinzow commented Nov 3, 2022

h-vetinari commented Nov 8, 2022

clarkzinzow commented Nov 8, 2022 •

edited

Loading

clarkzinzow commented Nov 8, 2022

[Datasets] [Arrow 7.0.0+ Support] [Mono-PR] Add support for Arrow 7+. #29161

[Datasets] [Arrow 7.0.0+ Support] [Mono-PR] Add support for Arrow 7+. #29161

Conversation

clarkzinzow commented Oct 7, 2022 • edited Loading

Summary of major changes

TODOs

Related Issue Number

Checks

scv119 commented Oct 14, 2022

jianoaix left a comment

Choose a reason for hiding this comment

jianoaix Oct 19, 2022

Choose a reason for hiding this comment

clarkzinzow Oct 19, 2022 • edited Loading

Choose a reason for hiding this comment

jianoaix Oct 19, 2022

Choose a reason for hiding this comment

clarkzinzow Oct 19, 2022

Choose a reason for hiding this comment

clarkzinzow commented Oct 19, 2022

clarkzinzow commented Nov 3, 2022

h-vetinari commented Nov 8, 2022

clarkzinzow commented Nov 8, 2022 • edited Loading

clarkzinzow commented Nov 8, 2022

clarkzinzow commented Oct 7, 2022 •

edited

Loading

clarkzinzow Oct 19, 2022 •

edited

Loading

clarkzinzow commented Nov 8, 2022 •

edited

Loading