New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[Datasets] Correct schema unification for Datasets with ragged Arrow arrays #31076

Merged

clarkzinzow merged 21 commits into ray-project:master from scottjlee:ragged-arrow-schema

Jan 4, 2023

Contributor

scottjlee commented Dec 13, 2022 •

edited

Loading

Signed-off-by: Scott Lee [email protected]

Why are these changes needed?

When creating Datasets with ragged arrays, the resulting Dataset incorrectly uses ArrowTensorArray instead of ArrowVariableShapedTensorArray as the underlying schema type. This PR refactors existing logic for schema unification into a separate function, which is now called during Arrow table concatenation and schema fetching to correct type promotion involving ragged arrays.

Related issue number

Closes #30082

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(


          initial progress

921872d

Signed-off-by: Scott Lee <[email protected]>

scottjlee requested review from ericl, scv119, clarkzinzow, jjyao, jianoaix and c21 as code owners

December 13, 2022 20:03

scottjlee added 4 commits

December 13, 2022 13:04


          more scratch work

d7a8472

Signed-off-by: Scott Lee <[email protected]>


          add scalar_type property in ArrowTensorType to simplify unify_schemas…

e5094ef

…() logic

Signed-off-by: Scott Lee <[email protected]>


          clean up and format

9a962c8

Signed-off-by: Scott Lee <[email protected]>


          add check for same type on simple blocks, lazy block list support

2530d6a

Signed-off-by: Scott Lee <[email protected]>

scottjlee commented

View reviewed changes

python/ray/data/_internal/arrow_ops/transform_pyarrow.py Show resolved Hide resolved

scottjlee assigned c21

scottjlee added 6 commits

December 14, 2022 12:18


          format

2068d12

Signed-off-by: Scott Lee <[email protected]>


          Merge branch 'master' into ragged-arrow-schema

f4587a8

Signed-off-by: Scott Lee <[email protected]>


          Merge branch 'master' into ragged-arrow-schema

b33f8c9

Signed-off-by: Scott Lee <[email protected]>


          improved typechecking on unify_schemas

ce43367

Signed-off-by: Scott Lee <[email protected]>


          check all blocks for potential pyarrow schema

c5f2561

Signed-off-by: Scott Lee <[email protected]>


          Merge branch 'master' into ragged-arrow-schema

dc36c50

Signed-off-by: Scott Lee <[email protected]>

clarkzinzow reviewed

View reviewed changes

python/ray/air/util/tensor_extensions/arrow.py Outdated Show resolved Hide resolved

python/ray/data/_internal/arrow_ops/transform_pyarrow.py Outdated Show resolved Hide resolved

python/ray/data/_internal/arrow_ops/transform_pyarrow.py Outdated Show resolved Hide resolved

python/ray/data/_internal/arrow_ops/transform_pyarrow.py Outdated Show resolved Hide resolved

python/ray/data/_internal/arrow_ops/transform_pyarrow.py Outdated Show resolved Hide resolved

python/ray/data/_internal/arrow_ops/transform_pyarrow.py Outdated Show resolved Hide resolved

python/ray/data/_internal/arrow_ops/transform_pyarrow.py Show resolved Hide resolved

python/ray/data/_internal/plan.py Outdated Show resolved Hide resolved

python/ray/data/_internal/plan.py Show resolved Hide resolved

python/ray/data/_internal/arrow_ops/transform_pyarrow.py Outdated Show resolved Hide resolved

scottjlee added 5 commits

December 19, 2022 22:20


          comments

cadbf24

Signed-off-by: Scott Lee <[email protected]>


          Merge branch 'master' into ragged-arrow-schema

46e11fb

Signed-off-by: Scott Lee <[email protected]>


          additional unit tests

fb93cad

Signed-off-by: Scott Lee <[email protected]>


          comments, format, clean up

669fcd3

Signed-off-by: Scott Lee <[email protected]>


          Merge branch 'master' into ragged-arrow-schema

2f0bbe2

Signed-off-by: Scott Lee <[email protected]>

scottjlee requested a review from clarkzinzow

December 20, 2022 19:45

clarkzinzow approved these changes

View reviewed changes

Contributor

clarkzinzow left a comment

LGTM!

python/ray/data/_internal/arrow_ops/transform_pyarrow.py Outdated Show resolved Hide resolved

python/ray/data/_internal/arrow_ops/transform_pyarrow.py

+                  else:
+                      schemas_to_unify = schemas
+                  # Let Arrow unify the schema of non-tensor extension type columns.
+                  return pyarrow.unify_schemas(schemas_to_unify)

Contributor

clarkzinzow Dec 20, 2022

This can be a future PR (I can do it as part of the type promotion PR), but we might want to try-except this pyarrow.unify_schemas() call, since this is the point at which we're validating that all of the schemas from different blocks are compatible. Propagating any exception raised from pyarrow.unify_schemas() seems fine for now, and in the future we can look at wrapping any raised exceptions with our own error indicating that the blocks have incompatible schemas and giving the user a path to rectifying this (e.g. manually specifying a schema at read time, so all blocks are consistent).

scottjlee added 3 commits

December 20, 2022 12:57


          final comments and format

6efd914

Signed-off-by: Scott Lee <[email protected]>


          Merge branch 'master' into ragged-arrow-schema

b5d916c

Signed-off-by: Scott Lee <[email protected]>


          Merge branch 'master' into ragged-arrow-schema

94a4b8d

Signed-off-by: Scott Lee <[email protected]>

c21 reviewed

View reviewed changes

python/ray/data/_internal/arrow_ops/transform_pyarrow.py Outdated Show resolved Hide resolved

scottjlee added 2 commits

January 3, 2023 12:41


          defer pyarrow import to unify_schemas func

55704e6

Signed-off-by: Scott Lee <[email protected]>


          Merge branch 'master' into ragged-arrow-schema

8b60918

Signed-off-by: Scott Lee <[email protected]>

clarkzinzow merged commit 17b2235 into ray-project:master

AmeerHajAli pushed a commit that referenced this pull request


          [Datasets] Fix schema unification for Datasets with ragged Arrow arra…

16db1ca

…ys (#31076)

When creating Datasets with ragged arrays, the resulting Dataset incorrectly uses ArrowTensorArray instead of ArrowVariableShapedTensorArray as the underlying schema type. This PR refactors existing logic for schema unification into a separate function, which is now called during Arrow table concatenation and schema fetching to correct type promotion involving ragged arrays.

Signed-off-by: Scott Lee <[email protected]>

tamohannes pushed a commit to ju2ez/ray that referenced this pull request


          [Datasets] Fix schema unification for Datasets with ragged Arrow arra…

c697b4e

…ys (ray-project#31076)

When creating Datasets with ragged arrays, the resulting Dataset incorrectly uses ArrowTensorArray instead of ArrowVariableShapedTensorArray as the underlying schema type. This PR refactors existing logic for schema unification into a separate function, which is now called during Arrow table concatenation and schema fetching to correct type promotion involving ragged arrays.

Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: tmynn <[email protected]>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

clarkzinzow clarkzinzow approved these changes

c21 c21 left review comments

ericl Awaiting requested review from ericl

scv119 Awaiting requested review from scv119

jjyao Awaiting requested review from jjyao

jianoaix Awaiting requested review from jianoaix

Labels

None yet