Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Datasets] Ragged Arrow datasets have incorrect schema #30082

Closed
bveeramani opened this issue Nov 8, 2022 · 7 comments · Fixed by #31076
Closed

[Datasets] Ragged Arrow datasets have incorrect schema #30082

bveeramani opened this issue Nov 8, 2022 · 7 comments · Fixed by #31076
Assignees
Labels
bug Something that is supposed to be working; but isn't data Ray Data-related issues P1 Issue that should be fixed within a few weeks

Comments

@bveeramani
Copy link
Member

bveeramani commented Nov 8, 2022

What happened + What you expected to happen

I created a dataset from ragged arrays. I expected the schema type to be ArrowVariableShapedTensorArray, but I got ArrowTensorArray instead.

Versions / Dependencies

Ray: 5aa1bb0
Python: 3.7.15
OS: Linux
Arrow: 6.0.1

Reproduction script

>>> import ray
>>> import numpy as np
>>> ds = ray.data.from_items([{"spam": np.zeros((32, 32, 3))}, {"spam": np.zeros((64, 64, 3))}])
>>> ds.schema().types
[ArrowTensorType(shape=(32, 32, 3), dtype=double)]

Issue Severity

Medium: It is a significant difficulty but I can work around it.

@bveeramani bveeramani added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) data Ray Data-related issues labels Nov 8, 2022
@bveeramani bveeramani changed the title [Datasets] Ragged Arrow datasets has incorrect schema [Datasets] Ragged Arrow datasets have incorrect schema Nov 8, 2022
@clarkzinzow
Copy link
Contributor

clarkzinzow commented Nov 8, 2022

This is because the default parallelism of ray.data.from_items() will give a block per row, and we perform type inference via only the first block. Since the blocks are trivially homogeneous-shaped (they each contain a single tensor element), the type comes back as ArrowTensorType.

We could do some schema promotion/unification if/when we have metadata for all blocks on hand. We already have this schema unification logic in the table concatenation path, all that we'd need to do is to factor it out of table concatenation and use it when fetching schema.

@bveeramani
Copy link
Member Author

Looks like this can an issue for read_images, too:

>>> ds = ray.data.read_images("python/ray/data/examples/data/image-datasets/different-sizes")
>>> ds
Dataset(num_blocks=3, num_rows=3, schema={image: ArrowTensorType(shape=(16, 16, 3), dtype=uint8)})

Since the blocks are trivially homogeneous-shaped (they each contain a single tensor element), the type comes back as ArrowTensorType.

To confirm my understanding, is it the case that whenever num_blocks = num_rows, ragged schemas won't be inferred?

@clarkzinzow
Copy link
Contributor

To confirm my understanding, is it the case that whenever num_blocks = num_rows, ragged schemas won't be inferred?

Correct, that will always be the case with the current schema fetching.

@bveeramani bveeramani self-assigned this Nov 8, 2022
@xwjiang2010
Copy link
Contributor

@bveeramani Can you triage the issue as well? (by removing the triage label and adding a priority label) Thanks!

@bveeramani bveeramani added P2 Important issue, but not time-critical and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Nov 8, 2022
@bveeramani
Copy link
Member Author

@xwjiang2010 done!

@bveeramani
Copy link
Member Author

Copying relevant offline discussion:

[To implement this you] should just need to create a unify_schemas(schemas: List[pa.Schema]) -> pa.Schema utility that’s used by both table concatenation and when fetching schema, and then some minor refactoring of how the ArrowTensorType --> ArrowVariableShapedTensorType promotion is implemented. Right now, we chunk the columns for all tables and see if the type was promoted to the variable-shaped tensor type (link), but I think we’d want to change this to an explicit ArrowTensorType._need_variable_shaped_tensor_array(types) check, where we refactor the existing method to take a list of tensor types rather than a list of tensor arrays.

@bveeramani bveeramani removed their assignment Dec 6, 2022
@bveeramani bveeramani added P1 Issue that should be fixed within a few weeks and removed P2 Important issue, but not time-critical labels Dec 10, 2022
@bveeramani
Copy link
Member Author

Bumping to P1 because this breaks prediction with ragged tensors

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't data Ray Data-related issues P1 Issue that should be fixed within a few weeks
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants