[Datasets] Arrow data buffers aren't truncated when pickling zero-copy slice views, leading to huge serialization bloat #29814

clarkzinzow · 2022-10-28T18:56:40Z

This is a duplicating issue of #22310, so we have a clean issue on which to summarize the problem.

When pickling Arrow data (tables, in our case), Arrow currently doesn't truncate the data buffers for zero-copy slice views of the data (see Arrow ticket). For many of our workloads, this results is sending a potentially very large data buffer over the network for a very small slice. For pathological cases like shuffling, where we might chunk a data block into 1000 chunks and send each to a different reduce, this would involve sending 1000 copies of the entire data block; for a 1 GiB data block, we'd send an aggregate 1 TiB of data over the network instead of the expected 1 GiB. This makes Ray + Arrow essentially unusable.

This bug exists in Arrow 6 through Arrow 10; we worked around the bug in Arrow 6 via explicit copying when slicing data, e.g. in our shuffle implementation, but Arrow 7+ adds internal zero-copy slicing that makes this workaround coverage untenable (e.g. when reading Parquet).

Solution

We can bypass this buggy pickle path by registering a custom serializer for Arrow data that properly truncates these buffers.

pcmoritz · 2022-11-02T07:15:51Z

Is it possible to fix the pickling path upstream and truncate the buffers there?

clarkzinzow · 2022-11-02T15:29:50Z

@pcmoritz Yes, I'm planning on doing that for Arrow 11: https://issues.apache.org/jira/browse/ARROW-10739

… Arrow serialization bug. (#29993) This PR adds support for Arrow 7 in Ray, and is the second PR in a set of stacked PRs making up this mono-PR for Arrow 7+ support: #29161, and is stacked on top of a PR fixing task cancellation in Ray Core: #29984. This PR: - fixes a serialization bug in Arrow with a custom serializer for Arrow data ([Datasets] Arrow data buffers aren't truncated when pickling zero-copy slice views, leading to huge serialization bloat #29814) - removes a bunch of defensive copying of Arrow data, which was a workaround for the aforementioned Arrow serialization bug - adds a CI job for Arrow 7 - bumps the pyarrow upper bound to 8.0.0

… Arrow serialization bug. (ray-project#29993) This PR adds support for Arrow 7 in Ray, and is the second PR in a set of stacked PRs making up this mono-PR for Arrow 7+ support: ray-project#29161, and is stacked on top of a PR fixing task cancellation in Ray Core: ray-project#29984. This PR: - fixes a serialization bug in Arrow with a custom serializer for Arrow data ([Datasets] Arrow data buffers aren't truncated when pickling zero-copy slice views, leading to huge serialization bloat ray-project#29814) - removes a bunch of defensive copying of Arrow data, which was a workaround for the aforementioned Arrow serialization bug - adds a CI job for Arrow 7 - bumps the pyarrow upper bound to 8.0.0 Signed-off-by: Weichen Xu <[email protected]>

clarkzinzow added bug Something that is supposed to be working; but isn't P1 Issue that should be fixed within a few weeks air data Ray Data-related issues labels Oct 28, 2022

clarkzinzow added this to the Arrow 7+ Support milestone Oct 28, 2022

clarkzinzow self-assigned this Oct 28, 2022

clarkzinzow mentioned this issue Oct 28, 2022

[Datasets] [Arrow 7.0.0+ Support] [Mono-PR] Add support for Arrow 7+. #29161

Closed

13 tasks

This was referenced Nov 3, 2022

[Datasets] Add support for using Arrow 7 with Ray (Core/Datasets/AIR). #29992

Closed

[Datasets] [Arrow 7+ Support - 2/N] Add support for Arrow 7 by fixing Arrow serialization bug. #29993

Merged

hora-anyscale added the Ray 2.2 label Nov 4, 2022

clarkzinzow closed this as completed in #29993 Nov 8, 2022

clarkzinzow mentioned this issue Nov 14, 2022

[Datasets] Upstream buffer truncation bug fix for pickling zero-copy slices to Apache Arrow #30254

Closed

mattip mentioned this issue Jun 19, 2023

ray-packages v2.5.0 conda-forge/ray-packages-feedstock#99

Merged

3 tasks

matthewdeng mentioned this issue Oct 23, 2023

[data] Windows Python 3.11 cannot install due to pyarrow version pin #38300

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Datasets] Arrow data buffers aren't truncated when pickling zero-copy slice views, leading to huge serialization bloat #29814

[Datasets] Arrow data buffers aren't truncated when pickling zero-copy slice views, leading to huge serialization bloat #29814

clarkzinzow commented Oct 28, 2022

pcmoritz commented Nov 2, 2022

clarkzinzow commented Nov 2, 2022

[Datasets] Arrow data buffers aren't truncated when pickling zero-copy slice views, leading to huge serialization bloat #29814

[Datasets] Arrow data buffers aren't truncated when pickling zero-copy slice views, leading to huge serialization bloat #29814

Comments

clarkzinzow commented Oct 28, 2022

Solution

pcmoritz commented Nov 2, 2022

clarkzinzow commented Nov 2, 2022