Revert "[Data] Change offsets to int64 and change to LargeList for ArrowTensorArray" #46511

can-anyscale · 2024-07-09T18:23:16Z

…rowTenso…" This reverts commit f4e8538.

scottjlee · 2024-07-09T18:51:19Z

@terraflops1048576 i am looking into the release test failure, let me see if there is a quick fix available, otherwise i will try to find a reproducible example.

can-anyscale · 2024-07-09T19:02:20Z

@scottjlee w00t i already revert it; feel free to merge it again once it's fixed

scottjlee · 2024-07-09T20:23:34Z

@terraflops1048576 here is a minimal reproducible example which fails:

import ray
data_url = f"s3://anonymous@air-example-data-2/10G-image-data-synthetic-raw-parquet/8cc8856e16c343829ef320fef4b353b1_000000.parquet"
ds = ray.data.read_parquet(data_url)

with this traceback:

Traceback (most recent call last):
  File "/home/ray/default/minimal.py", line 5, in <module>
    ds = ray.data.read_parquet(data_url)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/data/read_api.py", line 766, in read_parquet
    datasource = ParquetDatasource(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/data/_internal/datasource/parquet_datasource.py", line 229, in __init__
    pq_ds = pq.ParquetDataset(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/pyarrow/parquet/core.py", line 1776, in __new__
    return _ParquetDatasetV2(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/pyarrow/parquet/core.py", line 2479, in __init__
    [fragment], schema=schema or fragment.physical_schema,
  File "pyarrow/_dataset.pyx", line 1345, in pyarrow._dataset.Fragment.physical_schema.__get__
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/types.pxi", line 1661, in pyarrow.lib.PyExtensionType.__arrow_ext_deserialize__
TypeError: Expected storage type large_list<item: uint8> but got list<item: uint8>

I took an initial look, but wasn't able to find anything conclusive. Would you be able to take a look, and see if you can debug the issue? Thanks!

terraflops1048576 · 2024-07-10T15:50:10Z

@scottjlee The cause of this problem is that the example data uses the old tensor storage format, which uses lists (32-bit offset arrays), and the PR I introduced uses the new tensor storage format, which uses large lists (64-bit offset arrays). Therefore, loading from the old format is backward-incompatible without some careful consideration toward fixing it (or refreshing the examples to use the new format)

I think this is possibly solvable with some conversion code, but I'll have to give it more thought.

terraflops1048576 · 2024-07-10T15:57:19Z

It might very well be the case that we have to define a different extension type ArrowLargeTensorArray for backward compatibility and do conversion, but this seems like it would pollute the codebase.

scottjlee · 2024-07-10T20:02:56Z

Good find. i think for the best user experience, Ray Data should be able to handle both the old and new tensor storage format, without requiring the user to configure settings (so some form of auto-retry/fallback logic).

I think either
(a) defining a new extension type like ArrowLargeTensorArray, attempting to use this first, then fallback to ArrowTensorArray, or
(b) adding some additional conversion/backwards compatibility logic

would be two viable options. What do you think?

Revert "[Data] Change offsets to int64 and change to LargeList for Ar…

2f25f2a

…rowTenso…" This reverts commit f4e8538.

can-anyscale requested review from ericl, scv119, c21, amogkam, scottjlee, bveeramani, raulchen, stephanie-wang and omatthew98 as code owners July 9, 2024 18:23

can-anyscale merged commit 1549588 into master Jul 9, 2024
2 of 4 checks passed

can-anyscale deleted the revert-45352-terraflops/arrow_tensor_largelist branch July 9, 2024 18:24

This was referenced Jul 9, 2024

Release test stable_diffusion_benchmark.aws failed #46499

Closed

Release test torch_batch_inference_16_gpu_300gb_parquet.aws failed #46496

Closed

Release test torch_batch_inference_1_gpu_10gb_parquet.aws failed #46495

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revert "[Data] Change offsets to int64 and change to LargeList for ArrowTensorArray" #46511

Revert "[Data] Change offsets to int64 and change to LargeList for ArrowTensorArray" #46511

can-anyscale commented Jul 9, 2024

scottjlee commented Jul 9, 2024

can-anyscale commented Jul 9, 2024

scottjlee commented Jul 9, 2024

terraflops1048576 commented Jul 10, 2024 •

edited

Loading

terraflops1048576 commented Jul 10, 2024

scottjlee commented Jul 10, 2024

Revert "[Data] Change offsets to int64 and change to LargeList for ArrowTensorArray" #46511

Revert "[Data] Change offsets to int64 and change to LargeList for ArrowTensorArray" #46511

Conversation

can-anyscale commented Jul 9, 2024

scottjlee commented Jul 9, 2024

can-anyscale commented Jul 9, 2024

scottjlee commented Jul 9, 2024

terraflops1048576 commented Jul 10, 2024 • edited Loading

terraflops1048576 commented Jul 10, 2024

scottjlee commented Jul 10, 2024

terraflops1048576 commented Jul 10, 2024 •

edited

Loading