Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Datasets] Correct schema unification for Datasets with ragged Arrow arrays #31076

Merged
merged 21 commits into from
Jan 4, 2023

Conversation

scottjlee
Copy link
Contributor

@scottjlee scottjlee commented Dec 13, 2022

Signed-off-by: Scott Lee [email protected]

Why are these changes needed?

When creating Datasets with ragged arrays, the resulting Dataset incorrectly uses ArrowTensorArray instead of ArrowVariableShapedTensorArray as the underlying schema type. This PR refactors existing logic for schema unification into a separate function, which is now called during Arrow table concatenation and schema fetching to correct type promotion involving ragged arrays.

Related issue number

Closes #30082

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Scott Lee <[email protected]>
python/ray/air/util/tensor_extensions/arrow.py Outdated Show resolved Hide resolved
python/ray/data/_internal/arrow_ops/transform_pyarrow.py Outdated Show resolved Hide resolved
python/ray/data/_internal/arrow_ops/transform_pyarrow.py Outdated Show resolved Hide resolved
python/ray/data/_internal/arrow_ops/transform_pyarrow.py Outdated Show resolved Hide resolved
python/ray/data/_internal/arrow_ops/transform_pyarrow.py Outdated Show resolved Hide resolved
python/ray/data/_internal/arrow_ops/transform_pyarrow.py Outdated Show resolved Hide resolved
python/ray/data/_internal/plan.py Outdated Show resolved Hide resolved
python/ray/data/_internal/plan.py Show resolved Hide resolved
python/ray/data/_internal/arrow_ops/transform_pyarrow.py Outdated Show resolved Hide resolved
Copy link
Contributor

@clarkzinzow clarkzinzow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

python/ray/data/_internal/arrow_ops/transform_pyarrow.py Outdated Show resolved Hide resolved
else:
schemas_to_unify = schemas
# Let Arrow unify the schema of non-tensor extension type columns.
return pyarrow.unify_schemas(schemas_to_unify)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be a future PR (I can do it as part of the type promotion PR), but we might want to try-except this pyarrow.unify_schemas() call, since this is the point at which we're validating that all of the schemas from different blocks are compatible. Propagating any exception raised from pyarrow.unify_schemas() seems fine for now, and in the future we can look at wrapping any raised exceptions with our own error indicating that the blocks have incompatible schemas and giving the user a path to rectifying this (e.g. manually specifying a schema at read time, so all blocks are consistent).

@clarkzinzow clarkzinzow merged commit 17b2235 into ray-project:master Jan 4, 2023
AmeerHajAli pushed a commit that referenced this pull request Jan 12, 2023
…ys (#31076)

When creating Datasets with ragged arrays, the resulting Dataset incorrectly uses ArrowTensorArray instead of ArrowVariableShapedTensorArray as the underlying schema type. This PR refactors existing logic for schema unification into a separate function, which is now called during Arrow table concatenation and schema fetching to correct type promotion involving ragged arrays.

Signed-off-by: Scott Lee <[email protected]>
tamohannes pushed a commit to ju2ez/ray that referenced this pull request Jan 25, 2023
…ys (ray-project#31076)

When creating Datasets with ragged arrays, the resulting Dataset incorrectly uses ArrowTensorArray instead of ArrowVariableShapedTensorArray as the underlying schema type. This PR refactors existing logic for schema unification into a separate function, which is now called during Arrow table concatenation and schema fetching to correct type promotion involving ragged arrays.

Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: tmynn <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Datasets] Ragged Arrow datasets have incorrect schema
3 participants