[Datasets] Ragged Arrow datasets have incorrect schema #30082

bveeramani · 2022-11-08T01:21:48Z

What happened + What you expected to happen

I created a dataset from ragged arrays. I expected the schema type to be ArrowVariableShapedTensorArray, but I got ArrowTensorArray instead.

Versions / Dependencies

Ray: 5aa1bb0
Python: 3.7.15
OS: Linux
Arrow: 6.0.1

Reproduction script

>>> import ray
>>> import numpy as np
>>> ds = ray.data.from_items([{"spam": np.zeros((32, 32, 3))}, {"spam": np.zeros((64, 64, 3))}])
>>> ds.schema().types
[ArrowTensorType(shape=(32, 32, 3), dtype=double)]

Issue Severity

Medium: It is a significant difficulty but I can work around it.

The text was updated successfully, but these errors were encountered:

clarkzinzow · 2022-11-08T03:43:16Z

This is because the default parallelism of ray.data.from_items() will give a block per row, and we perform type inference via only the first block. Since the blocks are trivially homogeneous-shaped (they each contain a single tensor element), the type comes back as ArrowTensorType.

We could do some schema promotion/unification if/when we have metadata for all blocks on hand. We already have this schema unification logic in the table concatenation path, all that we'd need to do is to factor it out of table concatenation and use it when fetching schema.

bveeramani · 2022-11-08T18:40:32Z

Looks like this can an issue for read_images, too:

>>> ds = ray.data.read_images("python/ray/data/examples/data/image-datasets/different-sizes")
>>> ds
Dataset(num_blocks=3, num_rows=3, schema={image: ArrowTensorType(shape=(16, 16, 3), dtype=uint8)})

Since the blocks are trivially homogeneous-shaped (they each contain a single tensor element), the type comes back as ArrowTensorType.

To confirm my understanding, is it the case that whenever num_blocks = num_rows, ragged schemas won't be inferred?

clarkzinzow · 2022-11-08T18:51:04Z

To confirm my understanding, is it the case that whenever num_blocks = num_rows, ragged schemas won't be inferred?

Correct, that will always be the case with the current schema fetching.

xwjiang2010 · 2022-11-08T21:05:43Z

@bveeramani Can you triage the issue as well? (by removing the triage label and adding a priority label) Thanks!

bveeramani · 2022-11-08T22:04:47Z

@xwjiang2010 done!

bveeramani · 2022-12-06T17:33:05Z

Copying relevant offline discussion:

[To implement this you] should just need to create a unify_schemas(schemas: List[pa.Schema]) -> pa.Schema utility that’s used by both table concatenation and when fetching schema, and then some minor refactoring of how the ArrowTensorType --> ArrowVariableShapedTensorType promotion is implemented. Right now, we chunk the columns for all tables and see if the type was promoted to the variable-shaped tensor type (link), but I think we’d want to change this to an explicit ArrowTensorType._need_variable_shaped_tensor_array(types) check, where we refactor the existing method to take a list of tensor types rather than a list of tensor arrays.

bveeramani · 2022-12-10T02:52:03Z

Bumping to P1 because this breaks prediction with ragged tensors

bveeramani added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) data Ray Data-related issues labels Nov 8, 2022

bveeramani changed the title ~~[Datasets] Ragged Arrow datasets has incorrect schema~~ [Datasets] Ragged Arrow datasets have incorrect schema Nov 8, 2022

bveeramani self-assigned this Nov 8, 2022

bveeramani added P2 Important issue, but not time-critical and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Nov 8, 2022

clarkzinzow added the air label Nov 18, 2022

bveeramani removed their assignment Dec 6, 2022

c21 assigned scottjlee Dec 6, 2022

bveeramani added P1 Issue that should be fixed within a few weeks and removed P2 Important issue, but not time-critical labels Dec 10, 2022

scottjlee mentioned this issue Dec 13, 2022

[Datasets] Correct schema unification for Datasets with ragged Arrow arrays #31076

Merged

7 tasks

clarkzinzow closed this as completed in #31076 Jan 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Datasets] Ragged Arrow datasets have incorrect schema #30082

[Datasets] Ragged Arrow datasets have incorrect schema #30082

bveeramani commented Nov 8, 2022 •

edited

Loading

clarkzinzow commented Nov 8, 2022 •

edited

Loading

bveeramani commented Nov 8, 2022

clarkzinzow commented Nov 8, 2022

xwjiang2010 commented Nov 8, 2022

bveeramani commented Nov 8, 2022

bveeramani commented Dec 6, 2022

bveeramani commented Dec 10, 2022

[Datasets] Ragged Arrow datasets have incorrect schema #30082

[Datasets] Ragged Arrow datasets have incorrect schema #30082

Comments

bveeramani commented Nov 8, 2022 • edited Loading

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

clarkzinzow commented Nov 8, 2022 • edited Loading

bveeramani commented Nov 8, 2022

clarkzinzow commented Nov 8, 2022

xwjiang2010 commented Nov 8, 2022

bveeramani commented Nov 8, 2022

bveeramani commented Dec 6, 2022

bveeramani commented Dec 10, 2022

bveeramani commented Nov 8, 2022 •

edited

Loading

clarkzinzow commented Nov 8, 2022 •

edited

Loading