-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Datasets] [Arrow 7.0.0+ Support] [Mono-PR] Add support for Arrow 7+. #29161
[Datasets] [Arrow 7.0.0+ Support] [Mono-PR] Add support for Arrow 7+. #29161
Conversation
671896b
to
bc55881
Compare
40c55ff
to
43e8d7b
Compare
3b41046
to
49ddd31
Compare
49ddd31
to
5f612b4
Compare
also let's run release tests; it has been causing regressions. |
ee0f5cd
to
9356e55
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking care of those details of arrow versioning.
On the other hand, it gets quite complicated. I think we need an exit strategy.
How about we file a few well-documented bugs for handling the drop of each version? E.g. in the bug of dropping 8.0, we document what are the conditions to drop it, and what cleanups we need to take care of etc. This should make life a bit easier in the future.
.buildkite/pipeline.ml.yml
Outdated
@@ -472,7 +472,7 @@ | |||
# Dask tests and examples. | |||
- bazel test --config=ci $(./ci/run/bazel_export_options) --build_tests_only --test_tag_filters=-client python/ray/util/dask/... | |||
|
|||
- label: "Dataset tests" | |||
- label: "Dataset tests (Arrow - Latest Release)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this duplication really warranted? We have other dependency which is also a range of versions and we have to ensure correctness as well, but they aren't multiplexed like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since Arrow is extremely fundamental to Datasets and since there are a lot of breaking changes between Arrow versions, I think that it's warranted. We need to support Arrow 6 until we drop Python 3.6, and as you know, we have users that are stuck on Python 3.6 for the foreseeable future that we want to ensure are successful. Supporting Arrow 7 and 8 is minimal effort on top of Arrow 9, so I'm proposing that we start with full 6-9 coverage and then work on a deprecation plan that we can communicate to users. I think that the eventual state we'll want to get to is that we support the latest stable release that's supported for each Python version.
return uri | ||
parsed_uri = urlparse(uri) | ||
if parsed_uri.scheme == "s3": | ||
uri = _add_url_query_params(uri, {"allow_bucket_creation": True}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this override if there is existing allow_bucket_creation=False? If not, make it clear in the docstring about this semantics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By default, this won't override if the query parameter already exists; the optional override
argument controls this and is documented in the _add_url_query_params
docstring.
63f0631
to
9656624
Compare
@jianoaix Yep I plan to create those tracking issues once this is merged (i.e. once we've added support for Arrow 7+ in master)! I'll add that as a TODO for this PR + create draft issues. |
107f139
to
278c59c
Compare
… for pickling Arrow data to circumvent slice view buffer truncation bug. (ray-project#29055)" (ray-project#29138)" This reverts commit 081ce2f.
278c59c
to
381e4cb
Compare
@scv119 FYI, I've confirmed that |
…ent deserialization and task error storage. (#29984) Task execution currently doesn't handle KeyboardInterrupt (which is raised via the SIGINT signal upon task cancellation) during argument deserialization or task error storage, since we are only catching KeyboardInterrupt during actual execution of the task function. This can result in task cancellation causing the underlying worker to crash, since the task cancellation error may never be stored, causing a RAY_CHECK in the task transport to fail. Arguments that take a particularly long time to deserialize can cause this to be hit pretty consistently. This PR adds a top-level try-except on KeyboardInterrupt that covers (nearly) the entire execute_task function body, ensuring that the SIGINT is properly handled. This PR is the first PR in a set of stacked PRs making up this mono-PR for adding support for Arrow 7+ support in Ray: #29161
… Arrow serialization bug. (#29993) This PR adds support for Arrow 7 in Ray, and is the second PR in a set of stacked PRs making up this mono-PR for Arrow 7+ support: #29161, and is stacked on top of a PR fixing task cancellation in Ray Core: #29984. This PR: - fixes a serialization bug in Arrow with a custom serializer for Arrow data ([Datasets] Arrow data buffers aren't truncated when pickling zero-copy slice views, leading to huge serialization bloat #29814) - removes a bunch of defensive copying of Arrow data, which was a workaround for the aforementioned Arrow serialization bug - adds a CI job for Arrow 7 - bumps the pyarrow upper bound to 8.0.0
Just a note: This is becoming a topic of high importance in conda-forge. Arrow 10 was recently released, and we cannot reasonably support more than (an already exceptionally high) 4 arrow versions at the same time, for all the required migrations that are happening constantly. That means conda-forge will very soon be arrow 7+ only, and therefore ray will become progressively harder to install with other up-to-date packages, unless & until something like this PR gets merged & released. |
Thanks for the conda-forge context @h-vetinari! The good news is that this mono-PR has been decomposed into a stack of PRs, and the PR adding support for Arrow 7 has just been merged: #29993 The next PR in the stack adds support for Arrow 8 through Arrow nightly. #29999 So when conda-forge drops Arrow 6 support, Ray should hopefully be in a position to support any of the more recent major Arrow versions. |
…nd nightly. (#29999) This PR adds support for Arrow 8, 9, 10, and nightly in Ray, and is the third PR in a set of stacked PRs making up this mono-PR for Arrow 7+ support (#29161), and is stacked on top of a PR fixing task cancellation in Ray Core (#29984) and a PR adding support for Arrow 7 (#29993). The last two commits are the relevant commits for review. Summary of Changes This PR: - For Arrow 9+, add allow_bucket_creation=true to S3 URIs for the Ray Core Storage API and for the Datasets S3 write API ([Datasets] In Arrow 9+, creating S3 buckets requires explicit opt-in. #29815). - For Arrow 9+, create an ExtensionScalar subclass for tensor extension types that returns an ndarray view from .as_py() ([Datasets] For Arrow 8+, tensor column element access returns an ExtensionScalar. #29816). - For Arrow 8.*, we manually convert the ExtensionScalar to an ndarray for tensor extension types, since the ExtensionScalar type exists but isn't subclassable in Arrow 8 ([Datasets] For Arrow 8+, tensor column element access returns an ExtensionScalar. #29816). - For Arrow 10+, we match on other potential error messages when encountering permission issues when interacting with S3 ([Datasets] In Arrow 10+, S3 errors raised due to permission issues can vary beyond our current pattern matching #29994). - adds CI jobs for Arrow 8, 9, 10, and nightly - removes the pyarrow version upper bound
…ent deserialization and task error storage. (ray-project#29984) Task execution currently doesn't handle KeyboardInterrupt (which is raised via the SIGINT signal upon task cancellation) during argument deserialization or task error storage, since we are only catching KeyboardInterrupt during actual execution of the task function. This can result in task cancellation causing the underlying worker to crash, since the task cancellation error may never be stored, causing a RAY_CHECK in the task transport to fail. Arguments that take a particularly long time to deserialize can cause this to be hit pretty consistently. This PR adds a top-level try-except on KeyboardInterrupt that covers (nearly) the entire execute_task function body, ensuring that the SIGINT is properly handled. This PR is the first PR in a set of stacked PRs making up this mono-PR for adding support for Arrow 7+ support in Ray: ray-project#29161 Signed-off-by: Weichen Xu <[email protected]>
… Arrow serialization bug. (ray-project#29993) This PR adds support for Arrow 7 in Ray, and is the second PR in a set of stacked PRs making up this mono-PR for Arrow 7+ support: ray-project#29161, and is stacked on top of a PR fixing task cancellation in Ray Core: ray-project#29984. This PR: - fixes a serialization bug in Arrow with a custom serializer for Arrow data ([Datasets] Arrow data buffers aren't truncated when pickling zero-copy slice views, leading to huge serialization bloat ray-project#29814) - removes a bunch of defensive copying of Arrow data, which was a workaround for the aforementioned Arrow serialization bug - adds a CI job for Arrow 7 - bumps the pyarrow upper bound to 8.0.0 Signed-off-by: Weichen Xu <[email protected]>
…nd nightly. (ray-project#29999) This PR adds support for Arrow 8, 9, 10, and nightly in Ray, and is the third PR in a set of stacked PRs making up this mono-PR for Arrow 7+ support (ray-project#29161), and is stacked on top of a PR fixing task cancellation in Ray Core (ray-project#29984) and a PR adding support for Arrow 7 (ray-project#29993). The last two commits are the relevant commits for review. Summary of Changes This PR: - For Arrow 9+, add allow_bucket_creation=true to S3 URIs for the Ray Core Storage API and for the Datasets S3 write API ([Datasets] In Arrow 9+, creating S3 buckets requires explicit opt-in. ray-project#29815). - For Arrow 9+, create an ExtensionScalar subclass for tensor extension types that returns an ndarray view from .as_py() ([Datasets] For Arrow 8+, tensor column element access returns an ExtensionScalar. ray-project#29816). - For Arrow 8.*, we manually convert the ExtensionScalar to an ndarray for tensor extension types, since the ExtensionScalar type exists but isn't subclassable in Arrow 8 ([Datasets] For Arrow 8+, tensor column element access returns an ExtensionScalar. ray-project#29816). - For Arrow 10+, we match on other potential error messages when encountering permission issues when interacting with S3 ([Datasets] In Arrow 10+, S3 errors raised due to permission issues can vary beyond our current pattern matching ray-project#29994). - adds CI jobs for Arrow 8, 9, 10, and nightly - removes the pyarrow version upper bound Signed-off-by: Weichen Xu <[email protected]>
This reverts commit 081ce2f, and unreverts PR #29055.
This PR adds support for Arrow 9 while preserving support for Arrow 6, 7, 8; we duplicate the Datasets CI job for each of these Arrow major versions to ensure this support.
Summary of major changes
pyarrow >= 6.0.1
with no upper bound, except for Python 3.6 (Arrow dropped Python 3.6 support in Arrow 7) and Windows (Arrow has some blocker bugs in Arrow 7+) (link).allow_bucket_creation=true
to S3 URIs for the Ray Core Storage API and for the Datasets S3 write API (link 1, link 2).ExtensionScalar
subclass for tensor extension types that returns an ndarray view from.as_py()
(link.ExtensionScalar
to an ndarray for tensor extension types, since theExtensionScalar
type exists but isn't subclassable in Arrow 8 (link 1, link 2).We override table concatenation and tensor extension array concatenation to ensure that concatenating tensor column chunks that each have homogeneous-shape but cumulatively have different shapes will result in a variable-shaped tensor column (link 1, link 2, link 3, link 4).Pulled out into separate PR.ds.map_batches()
, which allows the user to opt in to zero-copy batching viaallow_mutate_batch=False
(link).TODOs
pg_long_running_performance_test
release test passes.ray._private.arrow_compat
to share between the Storage API, Ray AIR, and Ray Datasets.Related Issue Number
Closes #29814, closes #29816, closes #29815, closes #29817, closes #29822
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.