-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Datasets] Correct schema unification for Datasets with ragged Arrow arrays #31076
[Datasets] Correct schema unification for Datasets with ragged Arrow arrays #31076
Conversation
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
…() logic Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
else: | ||
schemas_to_unify = schemas | ||
# Let Arrow unify the schema of non-tensor extension type columns. | ||
return pyarrow.unify_schemas(schemas_to_unify) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be a future PR (I can do it as part of the type promotion PR), but we might want to try-except this pyarrow.unify_schemas()
call, since this is the point at which we're validating that all of the schemas from different blocks are compatible. Propagating any exception raised from pyarrow.unify_schemas()
seems fine for now, and in the future we can look at wrapping any raised exceptions with our own error indicating that the blocks have incompatible schemas and giving the user a path to rectifying this (e.g. manually specifying a schema at read time, so all blocks are consistent).
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
…ys (#31076) When creating Datasets with ragged arrays, the resulting Dataset incorrectly uses ArrowTensorArray instead of ArrowVariableShapedTensorArray as the underlying schema type. This PR refactors existing logic for schema unification into a separate function, which is now called during Arrow table concatenation and schema fetching to correct type promotion involving ragged arrays. Signed-off-by: Scott Lee <[email protected]>
…ys (ray-project#31076) When creating Datasets with ragged arrays, the resulting Dataset incorrectly uses ArrowTensorArray instead of ArrowVariableShapedTensorArray as the underlying schema type. This PR refactors existing logic for schema unification into a separate function, which is now called during Arrow table concatenation and schema fetching to correct type promotion involving ragged arrays. Signed-off-by: Scott Lee <[email protected]> Signed-off-by: tmynn <[email protected]>
Signed-off-by: Scott Lee [email protected]
Why are these changes needed?
When creating Datasets with ragged arrays, the resulting Dataset incorrectly uses
ArrowTensorArray
instead ofArrowVariableShapedTensorArray
as the underlying schema type. This PR refactors existing logic for schema unification into a separate function, which is now called during Arrow table concatenation and schema fetching to correct type promotion involving ragged arrays.Related issue number
Closes #30082
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.