Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert "[Data] Change offsets to int64 and change to LargeList for ArrowTensorArray" #46511

Merged

Conversation

can-anyscale
Copy link
Collaborator

Reverts #45352

Breaking #46499, #46496 and #46495

@scottjlee
Copy link
Contributor

@terraflops1048576 i am looking into the release test failure, let me see if there is a quick fix available, otherwise i will try to find a reproducible example.

@can-anyscale
Copy link
Collaborator Author

@scottjlee w00t i already revert it; feel free to merge it again once it's fixed

@scottjlee
Copy link
Contributor

@terraflops1048576 here is a minimal reproducible example which fails:

import ray
data_url = f"s3://anonymous@air-example-data-2/10G-image-data-synthetic-raw-parquet/8cc8856e16c343829ef320fef4b353b1_000000.parquet"
ds = ray.data.read_parquet(data_url)

with this traceback:

Traceback (most recent call last):
  File "/home/ray/default/minimal.py", line 5, in <module>
    ds = ray.data.read_parquet(data_url)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/data/read_api.py", line 766, in read_parquet
    datasource = ParquetDatasource(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/data/_internal/datasource/parquet_datasource.py", line 229, in __init__
    pq_ds = pq.ParquetDataset(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/pyarrow/parquet/core.py", line 1776, in __new__
    return _ParquetDatasetV2(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/pyarrow/parquet/core.py", line 2479, in __init__
    [fragment], schema=schema or fragment.physical_schema,
  File "pyarrow/_dataset.pyx", line 1345, in pyarrow._dataset.Fragment.physical_schema.__get__
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/types.pxi", line 1661, in pyarrow.lib.PyExtensionType.__arrow_ext_deserialize__
TypeError: Expected storage type large_list<item: uint8> but got list<item: uint8>

I took an initial look, but wasn't able to find anything conclusive. Would you be able to take a look, and see if you can debug the issue? Thanks!

@terraflops1048576
Copy link
Contributor

terraflops1048576 commented Jul 10, 2024

@scottjlee The cause of this problem is that the example data uses the old tensor storage format, which uses lists (32-bit offset arrays), and the PR I introduced uses the new tensor storage format, which uses large lists (64-bit offset arrays). Therefore, loading from the old format is backward-incompatible without some careful consideration toward fixing it (or refreshing the examples to use the new format)

I think this is possibly solvable with some conversion code, but I'll have to give it more thought.

@terraflops1048576
Copy link
Contributor

It might very well be the case that we have to define a different extension type ArrowLargeTensorArray for backward compatibility and do conversion, but this seems like it would pollute the codebase.

@scottjlee
Copy link
Contributor

Good find. i think for the best user experience, Ray Data should be able to handle both the old and new tensor storage format, without requiring the user to configure settings (so some form of auto-retry/fallback logic).

I think either
(a) defining a new extension type like ArrowLargeTensorArray, attempting to use this first, then fallback to ArrowTensorArray, or
(b) adding some additional conversion/backwards compatibility logic

would be two viable options. What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants