[Datasets] [Arrow 7+ Support - 2/N] Add support for Arrow 7 by fixing Arrow serialization bug. #29993

clarkzinzow · 2022-11-03T20:11:13Z

This PR adds support for Arrow 7 in Ray, and is the second PR in a set of stacked PRs making up this mono-PR for Arrow 7+ support: #29161, and is stacked on top of a PR fixing task cancellation in Ray Core: #29984.

Summary of Changes

This PR:

fixes a serialization bug in Arrow with a custom serializer for Arrow data ([Datasets] Arrow data buffers aren't truncated when pickling zero-copy slice views, leading to huge serialization bloat #29814)
removes a bunch of defensive copying of Arrow data, which was a workaround for the aforementioned Arrow serialization bug
adds a CI job for Arrow 7
bumps the pyarrow upper bound to 8.0.0

Related issue number

Closes #29992, closes #29814

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

jianoaix

Can we make PR title more specific about this fix? Or this can be done in two steps: 1) get the serialization bug fixed; 2) enable arrow 7?

python/ray/data/_internal/arrow_serialization.py

python/ray/data/datasource/parquet_datasource.py

clarkzinzow · 2022-11-03T21:26:33Z

Or this can be done in two steps: 1) get the serialization bug fixed; 2) enable arrow 7?

This Arrow serialization bug was the core blocker for supporting Arrow 7 upgrade, and in addition to the unit testing for our serialization workaround, I think that the best validation that the serialization bug is fixed is the Arrow 7 CI job running the Datasets/AIR test suite. Since the success criteria for the serialization bug fixes is strongly tied to whether Arrow 7 works, I think that keeping them in the same PR makes sense?

clarkzinzow · 2022-11-04T23:48:13Z

python/ray/air/util/tensor_extensions/arrow.py

-                # slices.
-                sliced = sliced[0:1]
-            return sliced
+            return super().__getitem__(key)


The Arrow extension array contract is generally to return another extension array for a slice, and the element type (in our case an ndarray) when getting a single row.

clarkzinzow · 2022-11-04T23:50:22Z

python/ray/data/datasource/parquet_datasource.py

@@ -434,7 +434,9 @@ def _sample_piece(

    # Only sample the first row group.
    piece = piece.subset(row_group_ids=[0])
-    batch_size = min(piece.metadata.num_rows, PARQUET_ENCODING_RATIO_ESTIMATE_NUM_ROWS)
+    batch_size = max(
+        min(piece.metadata.num_rows, PARQUET_ENCODING_RATIO_ESTIMATE_NUM_ROWS), 1


Arrow doesn't allow a batch size of 0; this is needed to handle empty Parquet files.

@clarkzinzow - thanks for the fix! I am thinking for reading Parquet, we should filter out all empty files before read. WDYT? I can do it as a followup.

Yep filtering out empty files when sampling is a good idea! It should be easy enough to filter out empty files before constructing the sample space: https://github.com/ray-project/ray/blob/abc0bb0a625e01aee9f2ccd777248892db2a5f6c/python/ray/data/datasource/parquet_datasource.py#L303-L308

Agreed on doing it as a follow-up.

c21

LGTM with some minor questions/comments.

c21 · 2022-11-07T17:47:30Z

python/ray/data/_internal/delegating_block_builder.py

@@ -59,6 +59,7 @@ def build(self) -> Block:
        if self._builder is None:
            if self._empty_block is not None:
                self._builder = BlockAccessor.for_block(self._empty_block).builder()
+                self._builder.add_block(self._empty_block)


wondering why this is required now? did we uncover some bugs from test?

Yes, I believe that this was also for the empty Parquet file case, and should be covered by the added test!

c21 · 2022-11-07T17:53:34Z

python/ray/data/datasource/parquet_datasource.py

@@ -434,7 +434,9 @@ def _sample_piece(

    # Only sample the first row group.
    piece = piece.subset(row_group_ids=[0])
-    batch_size = min(piece.metadata.num_rows, PARQUET_ENCODING_RATIO_ESTIMATE_NUM_ROWS)
+    batch_size = max(
+        min(piece.metadata.num_rows, PARQUET_ENCODING_RATIO_ESTIMATE_NUM_ROWS), 1


@clarkzinzow - thanks for the fix! I am thinking for reading Parquet, we should filter out all empty files before read. WDYT? I can do it as a followup.

c21 · 2022-11-07T18:03:00Z

python/ray/data/_internal/arrow_block.py

+
+        schema = self._row.schema
+        if isinstance(
+            schema.field(schema.get_field_index(key)).type,


nit: can it just be schema.field(key).type - https://arrow.apache.org/docs/python/generated/pyarrow.Schema.html#pyarrow.Schema.field ?

Hmm I had thought that this support was added in a post-Arrow 6 version, but looks like it's supported. I'll make this change!

c21 · 2022-11-07T18:06:22Z

python/ray/data/_internal/arrow_serialization.py

+    """
+    import pyarrow as pa
+
+    if os.environ.get(RAY_DISABLE_CUSTOM_ARROW_DATA_SERIALIZATION, "0") == "1":


Glad that we add an environment variable here!

jianoaix

Look good, thanks for making it simpler with Arrow IPC

clarkzinzow · 2022-11-07T20:00:03Z

@c21 @jianoaix Thank you for the reviews! I'll make these updates.

clarkzinzow · 2022-11-08T00:48:51Z

@ericl @scv119 @jjyao @richardliaw Could I get a rubber stamp for the setup.py changes?

https://github.com/ray-project/ray/pull/29993/files#diff-eb8b42d9346d0a5d371facf21a8bfa2d16fb49e213ae7c21f03863accebe0fcf

scv119

clarkzinzow · 2022-11-08T00:54:04Z

Ah sorry @scv119 I thought that you had the magic powers, but looks like it needs to be one of @ericl @richardliaw or @edoakes! ☹️

ericl

dep spec LGTM

…nd nightly. (#29999) This PR adds support for Arrow 8, 9, 10, and nightly in Ray, and is the third PR in a set of stacked PRs making up this mono-PR for Arrow 7+ support (#29161), and is stacked on top of a PR fixing task cancellation in Ray Core (#29984) and a PR adding support for Arrow 7 (#29993). The last two commits are the relevant commits for review. Summary of Changes This PR: - For Arrow 9+, add allow_bucket_creation=true to S3 URIs for the Ray Core Storage API and for the Datasets S3 write API ([Datasets] In Arrow 9+, creating S3 buckets requires explicit opt-in. #29815). - For Arrow 9+, create an ExtensionScalar subclass for tensor extension types that returns an ndarray view from .as_py() ([Datasets] For Arrow 8+, tensor column element access returns an ExtensionScalar. #29816). - For Arrow 8.*, we manually convert the ExtensionScalar to an ndarray for tensor extension types, since the ExtensionScalar type exists but isn't subclassable in Arrow 8 ([Datasets] For Arrow 8+, tensor column element access returns an ExtensionScalar. #29816). - For Arrow 10+, we match on other potential error messages when encountering permission issues when interacting with S3 ([Datasets] In Arrow 10+, S3 errors raised due to permission issues can vary beyond our current pattern matching #29994). - adds CI jobs for Arrow 8, 9, 10, and nightly - removes the pyarrow version upper bound

… Arrow serialization bug. (ray-project#29993) This PR adds support for Arrow 7 in Ray, and is the second PR in a set of stacked PRs making up this mono-PR for Arrow 7+ support: ray-project#29161, and is stacked on top of a PR fixing task cancellation in Ray Core: ray-project#29984. This PR: - fixes a serialization bug in Arrow with a custom serializer for Arrow data ([Datasets] Arrow data buffers aren't truncated when pickling zero-copy slice views, leading to huge serialization bloat ray-project#29814) - removes a bunch of defensive copying of Arrow data, which was a workaround for the aforementioned Arrow serialization bug - adds a CI job for Arrow 7 - bumps the pyarrow upper bound to 8.0.0 Signed-off-by: Weichen Xu <[email protected]>

…nd nightly. (ray-project#29999) This PR adds support for Arrow 8, 9, 10, and nightly in Ray, and is the third PR in a set of stacked PRs making up this mono-PR for Arrow 7+ support (ray-project#29161), and is stacked on top of a PR fixing task cancellation in Ray Core (ray-project#29984) and a PR adding support for Arrow 7 (ray-project#29993). The last two commits are the relevant commits for review. Summary of Changes This PR: - For Arrow 9+, add allow_bucket_creation=true to S3 URIs for the Ray Core Storage API and for the Datasets S3 write API ([Datasets] In Arrow 9+, creating S3 buckets requires explicit opt-in. ray-project#29815). - For Arrow 9+, create an ExtensionScalar subclass for tensor extension types that returns an ndarray view from .as_py() ([Datasets] For Arrow 8+, tensor column element access returns an ExtensionScalar. ray-project#29816). - For Arrow 8.*, we manually convert the ExtensionScalar to an ndarray for tensor extension types, since the ExtensionScalar type exists but isn't subclassable in Arrow 8 ([Datasets] For Arrow 8+, tensor column element access returns an ExtensionScalar. ray-project#29816). - For Arrow 10+, we match on other potential error messages when encountering permission issues when interacting with S3 ([Datasets] In Arrow 10+, S3 errors raised due to permission issues can vary beyond our current pattern matching ray-project#29994). - adds CI jobs for Arrow 8, 9, 10, and nightly - removes the pyarrow version upper bound Signed-off-by: Weichen Xu <[email protected]>

Toward fixing #38300 PR #29993 added a local ray fix for issue apache/arrow#26685, but at the time windows failed the tests with pyarrow7. In issue #38300 the suggested fix was to release the pin. Signed-off-by: mattip <[email protected]> Co-authored-by: Edward Oakes <[email protected]>

clarkzinzow assigned c21 Nov 3, 2022

clarkzinzow requested review from ericl, scv119, jjyao, jianoaix, c21, richardliaw and edoakes as code owners November 3, 2022 20:11

clarkzinzow assigned jianoaix Nov 3, 2022

clarkzinzow mentioned this pull request Nov 3, 2022

[Datasets] [Arrow 7+ Support - 3/N] Add support for Arrow 8, 9, 10, and nightly. #29999

Merged

7 tasks

jianoaix reviewed Nov 3, 2022

View reviewed changes

python/ray/data/_internal/arrow_serialization.py Outdated Show resolved Hide resolved

python/ray/data/datasource/parquet_datasource.py Show resolved Hide resolved

clarkzinzow mentioned this pull request Nov 3, 2022

[Datasets] [Arrow 7+ Support - 4/N] [RFC] Add zero-copy batch API for ds.map_batches(). #30000

Merged

12 tasks

clarkzinzow changed the title ~~[Datasets] [Arrow 7+ Support] [2/N] Add support for Arrow 7.~~ [Datasets] [Arrow 7+ Support] [2/N] Add support for Arrow 7 by fixing Arrow serialization bug. Nov 3, 2022

clarkzinzow force-pushed the datasets/feat/arrow-7-support branch 2 times, most recently from 41f77b6 to 66ad41c Compare November 4, 2022 17:47

clarkzinzow requested a review from jianoaix November 4, 2022 17:52

clarkzinzow force-pushed the datasets/feat/arrow-7-support branch from 8337b79 to abc0bb0 Compare November 4, 2022 23:46

clarkzinzow commented Nov 4, 2022

View reviewed changes

c21 approved these changes Nov 7, 2022

View reviewed changes

jianoaix approved these changes Nov 7, 2022

View reviewed changes

clarkzinzow added 4 commits November 7, 2022 22:10

Add Arrow 7 support.

a01418b

Fix comment about custom Arrow serialization workaround.

9f7d2aa

Add test covering empty Parquet file reading.

144cb13

PR feedback.

bfdbd7e

clarkzinzow force-pushed the datasets/feat/arrow-7-support branch from abc0bb0 to bfdbd7e Compare November 7, 2022 22:10

clarkzinzow assigned ericl Nov 8, 2022

clarkzinzow assigned scv119, jjyao and richardliaw Nov 8, 2022

scv119 approved these changes Nov 8, 2022

View reviewed changes

ericl approved these changes Nov 8, 2022

View reviewed changes

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Nov 8, 2022

clarkzinzow merged commit 75b206e into ray-project:master Nov 8, 2022

clarkzinzow mentioned this pull request Nov 8, 2022

[Datasets] [Arrow 7.0.0+ Support] [Mono-PR] Add support for Arrow 7+. #29161

Closed

13 tasks

clarkzinzow changed the title ~~[Datasets] [Arrow 7+ Support] [2/N] Add support for Arrow 7 by fixing Arrow serialization bug.~~ [Datasets] [Arrow 7+ Support - 2/N] Add support for Arrow 7 by fixing Arrow serialization bug. Nov 8, 2022

clarkzinzow mentioned this pull request Nov 14, 2022

[Datasets] Upstream buffer truncation bug fix for pickling zero-copy slices to Apache Arrow #30254

Closed

mattip mentioned this pull request Jun 19, 2023

ray-packages v2.5.0 conda-forge/ray-packages-feedstock#99

Merged

3 tasks

scottjlee mentioned this pull request Oct 2, 2023

[Core] Unable to install ray[all] in python #39727

Closed

mattip mentioned this pull request Dec 26, 2023

unpin pyarrow for windows #42097

Merged

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Datasets] [Arrow 7+ Support - 2/N] Add support for Arrow 7 by fixing Arrow serialization bug. #29993

[Datasets] [Arrow 7+ Support - 2/N] Add support for Arrow 7 by fixing Arrow serialization bug. #29993

clarkzinzow commented Nov 3, 2022 •

edited

Loading

jianoaix left a comment

clarkzinzow commented Nov 3, 2022

clarkzinzow Nov 4, 2022 •

edited

Loading

clarkzinzow Nov 4, 2022

c21 Nov 7, 2022

clarkzinzow Nov 7, 2022 •

edited

Loading

c21 left a comment

c21 Nov 7, 2022

clarkzinzow Nov 7, 2022

c21 Nov 7, 2022

c21 Nov 7, 2022

clarkzinzow Nov 7, 2022

c21 Nov 7, 2022

jianoaix left a comment

clarkzinzow commented Nov 7, 2022

clarkzinzow commented Nov 8, 2022 •

edited

Loading

scv119 left a comment

clarkzinzow commented Nov 8, 2022 •

edited

Loading

ericl left a comment

[Datasets] [Arrow 7+ Support - 2/N] Add support for Arrow 7 by fixing Arrow serialization bug. #29993

[Datasets] [Arrow 7+ Support - 2/N] Add support for Arrow 7 by fixing Arrow serialization bug. #29993

Conversation

clarkzinzow commented Nov 3, 2022 • edited Loading

Summary of Changes

Related issue number

Checks

jianoaix left a comment

Choose a reason for hiding this comment

clarkzinzow commented Nov 3, 2022

clarkzinzow Nov 4, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clarkzinzow Nov 7, 2022 • edited Loading

Choose a reason for hiding this comment

c21 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jianoaix left a comment

Choose a reason for hiding this comment

clarkzinzow commented Nov 7, 2022

clarkzinzow commented Nov 8, 2022 • edited Loading

scv119 left a comment

Choose a reason for hiding this comment

clarkzinzow commented Nov 8, 2022 • edited Loading

ericl left a comment

Choose a reason for hiding this comment

clarkzinzow commented Nov 3, 2022 •

edited

Loading

clarkzinzow Nov 4, 2022 •

edited

Loading

clarkzinzow Nov 7, 2022 •

edited

Loading

clarkzinzow commented Nov 8, 2022 •

edited

Loading

clarkzinzow commented Nov 8, 2022 •

edited

Loading