Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Datasets] [Arrow 7.0.0+ Support] [Mono-PR] Add support for Arrow 7+. #29161

Conversation

clarkzinzow
Copy link
Contributor

@clarkzinzow clarkzinzow commented Oct 7, 2022

This reverts commit 081ce2f, and unreverts PR #29055.

This PR adds support for Arrow 9 while preserving support for Arrow 6, 7, 8; we duplicate the Datasets CI job for each of these Arrow major versions to ensure this support.

Summary of major changes

  • We change the Arrow version bounds to be pyarrow >= 6.0.1 with no upper bound, except for Python 3.6 (Arrow dropped Python 3.6 support in Arrow 7) and Windows (Arrow has some blocker bugs in Arrow 7+) (link).
  • We duplicate the Datasets CI job for each major Arrow version we support: 6, 7, 8, and latest stable version (currently 9) (link).
  • For Arrow 9+, add allow_bucket_creation=true to S3 URIs for the Ray Core Storage API and for the Datasets S3 write API (link 1, link 2).
  • For Arrow 9+, create an ExtensionScalar subclass for tensor extension types that returns an ndarray view from .as_py() (link.
  • For Arrow 8.*, we manually convert the ExtensionScalar to an ndarray for tensor extension types, since the ExtensionScalar type exists but isn't subclassable in Arrow 8 (link 1, link 2).
  • For Arrow 10+, we match on other potential error messages when encountering permission issues when interacting with S3.
  • We override table concatenation and tensor extension array concatenation to ensure that concatenating tensor column chunks that each have homogeneous-shape but cumulatively have different shapes will result in a variable-shaped tensor column (link 1, link 2, link 3, link 4). Pulled out into separate PR.
  • Custom serialization hook for Arrow data types to work around Arrow serialization bug (pickling zero-copy views on Arrow data serializes the entire data buffer instead of truncating it) (link).
  • Adds batch mutation API to ds.map_batches(), which allows the user to opt in to zero-copy batching via allow_mutate_batch=False (link).
  • Fixes unhandled task cancellation when deserializing task arguments and storing task errors.

TODOs

  • Ensure that relevant AIR-level tests run under Datasets CI jobs (e.g. extension type tests) that are running under all major Arrow versions.
  • Ensure that Core tests that are sensitive to worker startup times are passing.
  • Ensure that pg_long_running_performance_test release test passes.
  • Decompose PR into stacked PRs.
  • (Post-merge) Open tracking issues for deprecating Arrow 6, 7, and 8.
  • (Follow-up PR) Consolidate logic handling of Arrow-version-specific things to a ray._private.arrow_compat to share between the Storage API, Ray AIR, and Ray Datasets.

Related Issue Number

Closes #29814, closes #29816, closes #29815, closes #29817, closes #29822

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@clarkzinzow clarkzinzow force-pushed the datasets/unrevert/arrow-serialization-ipc branch 4 times, most recently from 671896b to bc55881 Compare October 7, 2022 23:18
@clarkzinzow clarkzinzow force-pushed the datasets/unrevert/arrow-serialization-ipc branch 2 times, most recently from 40c55ff to 43e8d7b Compare October 13, 2022 17:12
@clarkzinzow clarkzinzow force-pushed the datasets/unrevert/arrow-serialization-ipc branch from 3b41046 to 49ddd31 Compare October 13, 2022 22:17
@clarkzinzow clarkzinzow force-pushed the datasets/unrevert/arrow-serialization-ipc branch from 49ddd31 to 5f612b4 Compare October 13, 2022 22:47
@scv119
Copy link
Contributor

scv119 commented Oct 14, 2022

also let's run release tests; it has been causing regressions.

@clarkzinzow clarkzinzow force-pushed the datasets/unrevert/arrow-serialization-ipc branch 5 times, most recently from ee0f5cd to 9356e55 Compare October 18, 2022 20:06
Copy link
Contributor

@jianoaix jianoaix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking care of those details of arrow versioning.

On the other hand, it gets quite complicated. I think we need an exit strategy.
How about we file a few well-documented bugs for handling the drop of each version? E.g. in the bug of dropping 8.0, we document what are the conditions to drop it, and what cleanups we need to take care of etc. This should make life a bit easier in the future.

@@ -472,7 +472,7 @@
# Dask tests and examples.
- bazel test --config=ci $(./ci/run/bazel_export_options) --build_tests_only --test_tag_filters=-client python/ray/util/dask/...

- label: "Dataset tests"
- label: "Dataset tests (Arrow - Latest Release)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this duplication really warranted? We have other dependency which is also a range of versions and we have to ensure correctness as well, but they aren't multiplexed like this.

Copy link
Contributor Author

@clarkzinzow clarkzinzow Oct 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since Arrow is extremely fundamental to Datasets and since there are a lot of breaking changes between Arrow versions, I think that it's warranted. We need to support Arrow 6 until we drop Python 3.6, and as you know, we have users that are stuck on Python 3.6 for the foreseeable future that we want to ensure are successful. Supporting Arrow 7 and 8 is minimal effort on top of Arrow 9, so I'm proposing that we start with full 6-9 coverage and then work on a deprecation plan that we can communicate to users. I think that the eventual state we'll want to get to is that we support the latest stable release that's supported for each Python version.

return uri
parsed_uri = urlparse(uri)
if parsed_uri.scheme == "s3":
uri = _add_url_query_params(uri, {"allow_bucket_creation": True})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this override if there is existing allow_bucket_creation=False? If not, make it clear in the docstring about this semantics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By default, this won't override if the query parameter already exists; the optional override argument controls this and is documented in the _add_url_query_params docstring.

@clarkzinzow clarkzinzow force-pushed the datasets/unrevert/arrow-serialization-ipc branch 3 times, most recently from 63f0631 to 9656624 Compare October 19, 2022 18:09
@clarkzinzow
Copy link
Contributor Author

How about we file a few well-documented bugs for handling the drop of each version? E.g. in the bug of dropping 8.0, we document what are the conditions to drop it, and what cleanups we need to take care of etc. This should make life a bit easier in the future.

@jianoaix Yep I plan to create those tracking issues once this is merged (i.e. once we've added support for Arrow 7+ in master)! I'll add that as a TODO for this PR + create draft issues.

@clarkzinzow clarkzinzow force-pushed the datasets/unrevert/arrow-serialization-ipc branch 3 times, most recently from 107f139 to 278c59c Compare November 2, 2022 23:14
@clarkzinzow clarkzinzow force-pushed the datasets/unrevert/arrow-serialization-ipc branch from 278c59c to 381e4cb Compare November 3, 2022 13:18
@clarkzinzow clarkzinzow changed the title [Datasets] Unrevert "Arrow 7.0.0+ Support: Use Arrow IPC format for pickling Arrow data to circumvent slice view buffer truncation bug. (#29055)" [Datasets] [Arrow 7.0.0+ Support] [Mono-PR] Add support for Arrow 7+. Nov 3, 2022
@clarkzinzow
Copy link
Contributor Author

@scv119 FYI, I've confirmed that pg_long_running_performance_test, which was failing for the original PR in master, passes: https://buildkite.com/ray-project/release-tests-pr/builds/20110#01843e16-df16-47a9-a2d2-4fd32ea48750

clarkzinzow added a commit that referenced this pull request Nov 4, 2022
…ent deserialization and task error storage. (#29984)

Task execution currently doesn't handle KeyboardInterrupt (which is raised via the SIGINT signal upon task cancellation) during argument deserialization or task error storage, since we are only catching KeyboardInterrupt during actual execution of the task function. This can result in task cancellation causing the underlying worker to crash, since the task cancellation error may never be stored, causing a RAY_CHECK in the task transport to fail. Arguments that take a particularly long time to deserialize can cause this to be hit pretty consistently.

This PR adds a top-level try-except on KeyboardInterrupt that covers (nearly) the entire execute_task function body, ensuring that the SIGINT is properly handled.

This PR is the first PR in a set of stacked PRs making up this mono-PR for adding support for Arrow 7+ support in Ray: #29161
clarkzinzow added a commit that referenced this pull request Nov 8, 2022
… Arrow serialization bug. (#29993)

This PR adds support for Arrow 7 in Ray, and is the second PR in a set of stacked PRs making up this mono-PR for Arrow 7+ support: #29161, and is stacked on top of a PR fixing task cancellation in Ray Core: #29984.

This PR:
- fixes a serialization bug in Arrow with a custom serializer for Arrow data ([Datasets] Arrow data buffers aren't truncated when pickling zero-copy slice views, leading to huge serialization bloat #29814)
- removes a bunch of defensive copying of Arrow data, which was a workaround for the aforementioned Arrow serialization bug
- adds a CI job for Arrow 7
- bumps the pyarrow upper bound to 8.0.0
@h-vetinari
Copy link

Just a note: This is becoming a topic of high importance in conda-forge. Arrow 10 was recently released, and we cannot reasonably support more than (an already exceptionally high) 4 arrow versions at the same time, for all the required migrations that are happening constantly. That means conda-forge will very soon be arrow 7+ only, and therefore ray will become progressively harder to install with other up-to-date packages, unless & until something like this PR gets merged & released.

@clarkzinzow
Copy link
Contributor Author

clarkzinzow commented Nov 8, 2022

Thanks for the conda-forge context @h-vetinari! The good news is that this mono-PR has been decomposed into a stack of PRs, and the PR adding support for Arrow 7 has just been merged: #29993

The next PR in the stack adds support for Arrow 8 through Arrow nightly. #29999

So when conda-forge drops Arrow 6 support, Ray should hopefully be in a position to support any of the more recent major Arrow versions.

@clarkzinzow clarkzinzow closed this Nov 8, 2022
clarkzinzow added a commit that referenced this pull request Nov 9, 2022
…nd nightly. (#29999)

This PR adds support for Arrow 8, 9, 10, and nightly in Ray, and is the third PR in a set of stacked PRs making up this mono-PR for Arrow 7+ support (#29161), and is stacked on top of a PR fixing task cancellation in Ray Core (#29984) and a PR adding support for Arrow 7 (#29993). The last two commits are the relevant commits for review.

Summary of Changes

This PR:

- For Arrow 9+, add allow_bucket_creation=true to S3 URIs for the Ray Core Storage API and for the Datasets S3 write API ([Datasets] In Arrow 9+, creating S3 buckets requires explicit opt-in. #29815).
- For Arrow 9+, create an ExtensionScalar subclass for tensor extension types that returns an ndarray view from .as_py() ([Datasets] For Arrow 8+, tensor column element access returns an ExtensionScalar. #29816).
- For Arrow 8.*, we manually convert the ExtensionScalar to an ndarray for tensor extension types, since the ExtensionScalar type exists but isn't subclassable in Arrow 8 ([Datasets] For Arrow 8+, tensor column element access returns an ExtensionScalar. #29816).
- For Arrow 10+, we match on other potential error messages when encountering permission issues when interacting with S3 ([Datasets] In Arrow 10+, S3 errors raised due to permission issues can vary beyond our current pattern matching #29994).
- adds CI jobs for Arrow 8, 9, 10, and nightly
- removes the pyarrow version upper bound
WeichenXu123 pushed a commit to WeichenXu123/ray that referenced this pull request Dec 19, 2022
…ent deserialization and task error storage. (ray-project#29984)

Task execution currently doesn't handle KeyboardInterrupt (which is raised via the SIGINT signal upon task cancellation) during argument deserialization or task error storage, since we are only catching KeyboardInterrupt during actual execution of the task function. This can result in task cancellation causing the underlying worker to crash, since the task cancellation error may never be stored, causing a RAY_CHECK in the task transport to fail. Arguments that take a particularly long time to deserialize can cause this to be hit pretty consistently.

This PR adds a top-level try-except on KeyboardInterrupt that covers (nearly) the entire execute_task function body, ensuring that the SIGINT is properly handled.

This PR is the first PR in a set of stacked PRs making up this mono-PR for adding support for Arrow 7+ support in Ray: ray-project#29161

Signed-off-by: Weichen Xu <[email protected]>
WeichenXu123 pushed a commit to WeichenXu123/ray that referenced this pull request Dec 19, 2022
… Arrow serialization bug. (ray-project#29993)

This PR adds support for Arrow 7 in Ray, and is the second PR in a set of stacked PRs making up this mono-PR for Arrow 7+ support: ray-project#29161, and is stacked on top of a PR fixing task cancellation in Ray Core: ray-project#29984.

This PR:
- fixes a serialization bug in Arrow with a custom serializer for Arrow data ([Datasets] Arrow data buffers aren't truncated when pickling zero-copy slice views, leading to huge serialization bloat ray-project#29814)
- removes a bunch of defensive copying of Arrow data, which was a workaround for the aforementioned Arrow serialization bug
- adds a CI job for Arrow 7
- bumps the pyarrow upper bound to 8.0.0

Signed-off-by: Weichen Xu <[email protected]>
WeichenXu123 pushed a commit to WeichenXu123/ray that referenced this pull request Dec 19, 2022
…nd nightly. (ray-project#29999)

This PR adds support for Arrow 8, 9, 10, and nightly in Ray, and is the third PR in a set of stacked PRs making up this mono-PR for Arrow 7+ support (ray-project#29161), and is stacked on top of a PR fixing task cancellation in Ray Core (ray-project#29984) and a PR adding support for Arrow 7 (ray-project#29993). The last two commits are the relevant commits for review.

Summary of Changes

This PR:

- For Arrow 9+, add allow_bucket_creation=true to S3 URIs for the Ray Core Storage API and for the Datasets S3 write API ([Datasets] In Arrow 9+, creating S3 buckets requires explicit opt-in. ray-project#29815).
- For Arrow 9+, create an ExtensionScalar subclass for tensor extension types that returns an ndarray view from .as_py() ([Datasets] For Arrow 8+, tensor column element access returns an ExtensionScalar. ray-project#29816).
- For Arrow 8.*, we manually convert the ExtensionScalar to an ndarray for tensor extension types, since the ExtensionScalar type exists but isn't subclassable in Arrow 8 ([Datasets] For Arrow 8+, tensor column element access returns an ExtensionScalar. ray-project#29816).
- For Arrow 10+, we match on other potential error messages when encountering permission issues when interacting with S3 ([Datasets] In Arrow 10+, S3 errors raised due to permission issues can vary beyond our current pattern matching ray-project#29994).
- adds CI jobs for Arrow 8, 9, 10, and nightly
- removes the pyarrow version upper bound

Signed-off-by: Weichen Xu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants