-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Datasets] Ragged Arrow datasets have incorrect schema #30082
Comments
This is because the default parallelism of We could do some schema promotion/unification if/when we have metadata for all blocks on hand. We already have this schema unification logic in the table concatenation path, all that we'd need to do is to factor it out of table concatenation and use it when fetching schema. |
Looks like this can an issue for
To confirm my understanding, is it the case that whenever |
Correct, that will always be the case with the current schema fetching. |
@bveeramani Can you triage the issue as well? (by removing the triage label and adding a priority label) Thanks! |
@xwjiang2010 done! |
Copying relevant offline discussion:
|
Bumping to P1 because this breaks prediction with ragged tensors |
What happened + What you expected to happen
I created a dataset from ragged arrays. I expected the schema type to be
ArrowVariableShapedTensorArray
, but I gotArrowTensorArray
instead.Versions / Dependencies
Ray: 5aa1bb0
Python: 3.7.15
OS: Linux
Arrow: 6.0.1
Reproduction script
Issue Severity
Medium: It is a significant difficulty but I can work around it.
The text was updated successfully, but these errors were encountered: