[Datasets] Fix ndarray representation of single-element ragged tensor slices. #30514

clarkzinzow · 2022-11-20T01:11:05Z

Single-element ragged tensor slices (e.g. a[5:6]) currently have the wrong NumPy representation; namely, although they are object-dtyped, they have a multi-dimensional shape, and its single tensor element isn't well-typed (in other words, it doesn't use the pointer-to-subdarrays representation). This is due to np.array([subndarray], dtype=object) trying to create a more consolidated representation than np.array([subndarray1, subndarray2], dtype=object). This causes single-element batches of ragged tensor slices failing to eventually be put back into the tensor extension representation.

This PR fixes this by doing a very explicit ragged tensor construction via the create-and-fill method: we allocate an empty, object-dtyped 1D array and fill it with the tensor elements. This prevents NumPy from trying to optimize the ragged tensor representation.

Example

Have subndarray = np.array([[1, 2], [3, 4]], dtype=np.int64) and arr be an (N, 2, 2) ndarray.

N > 1 (Single-Element) Case

With arr = np.array([subndarray, subndarray], dtype=object]), you get

ndarray(
    [
        ndarray([[1, 2], [3, 4]], dtype=np.int64),
        ndarray([[1, 2], [3, 4]], dtype=np.int64),
    ],
    dtype=object
)

I.e., you get a 1D array of pointers to the subndarrays, so the subndarrays are well-typed, e.g. arr[0].dtype == np.int64.

N == 1 (Multi-Element) Case

But the single-element case, arr = np.array([subndarray], dtype=object), tries to consolidate the inner subndarray into the outer ndarray representation:

ndarray([[[1, 2], [3, 4]]], dtype=object)

The biggest impact of this is arr[0].dtype == object, which breaks a big tensor extension array assumption: that individual tensor elements are well-typed.

Related issue number

Closes #30513, closes #30406, closes #30059

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

bveeramani

LGTM w/ comments

python/ray/air/util/tensor_extensions/arrow.py

bveeramani · 2022-11-21T21:11:44Z

python/ray/air/tests/test_tensor_extension.py

@@ -178,7 +178,14 @@ def test_arrow_variable_shaped_tensor_array_slice():
        slice(0, 3),
    ]
    for slice_ in slices:
-        for o, e in zip(ata[slice_], arr[slice_]):
+        ata_slice = ata[slice_]
+        ata_slice_np = ata_slice.to_numpy()


Are we converting to NumPy here so we can get the dtype?

In addition to using NumPy as the "source of truth" for slicing semantics for multi-dimensional arrays (which we typically do), we're primarily interested in whether the slicing results in the expected semantics for the NumPy views of the data rather than the Arrow data, e.g. the fact that slicing preserves the Arrow-level typing doesn't need to be tested, that's guaranteed by Arrow's extension type slicing, but we need to make sure that the NumPy-level dtype is what we'd expect.

bveeramani · 2022-11-21T21:12:44Z

python/ray/air/tests/test_tensor_extension.py

+        ata_slice_np = ata_slice.to_numpy()
+        arr_slice = arr[slice_]
+        # Check for equivalent dtypes and shapes.
+        assert ata_slice_np.dtype == arr_slice.dtype


Why doesn't ArrowTensorArrow expose dtype and shape methods like TensorArray?

Both the data type and element shape are exposed on the underlying tensor array type, i.e. ata.type.storage.value_type for the Arrow type, ata.type.to_pandas_dtype().numpy_dtype for the NumPy dtype, and ata.type.shape for the element shape.

We haven't exposed them directly on ArrowTensorArray under the assumption that the user will typically be working with the Pandas-side TensorArray extension type (if they're interacting with tensor extension types at all; they're hidden by default), so the Arrow side isn't optimized for end-user interaction.

Agreed that making the Arrow-side extension type more user-friendly is worth doing though, we should open a ticket for that.

…d tensor slices. (#30514)" This reverts commit 36aebcb.

…d tensor slices. (#30514)" (#30709) This reverts commit 36aebcb. Reverts #30514 This is causing linux://python/ray/data:tests/test_transform_pyarrow to fail.

…nt ragged tensor slices. (ray-project#30514)" (ray-project#30709)" This reverts commit 579770a.

…ged tensor slices. (#30514)" (#30721) This unreverts #30514 by reverting commit 579770a. A test was merged into master while the original PR was open, which then broke when the original PR was merged. This wasn't caught in pre-merge checks since the PR was merged without having rebased onto latest master.

…ged tensor slices. (ray-project#30514)" (ray-project#30721) This unreverts ray-project#30514 by reverting commit 579770a. A test was merged into master while the original PR was open, which then broke when the original PR was merged. This wasn't caught in pre-merge checks since the PR was merged without having rebased onto latest master.

…ged tensor slices. (#30514)" (#30721) (#30752) This unreverts #30514 by reverting commit 579770a. A test was merged into master while the original PR was open, which then broke when the original PR was merged. This wasn't caught in pre-merge checks since the PR was merged without having rebased onto latest master.

… slices. (ray-project#30514) Single-element ragged tensor slices (e.g. a[5:6]) currently have the wrong NumPy representation; namely, although they are object-dtyped, they have a multi-dimensional shape, and its single tensor element isn't well-typed (in other words, it doesn't use the pointer-to-subdarrays representation). This is due to np.array([subndarray], dtype=object) trying to create a more consolidated representation than np.array([subndarray1, subndarray2], dtype=object). This causes single-element batches of ragged tensor slices failing to eventually be put back into the tensor extension representation. This PR fixes this by doing a very explicit ragged tensor construction via the create-and-fill method: we allocate an empty, object-dtyped 1D array and fill it with the tensor elements. This prevents NumPy from trying to optimize the ragged tensor representation. Signed-off-by: Weichen Xu <[email protected]>

…d tensor slices. (ray-project#30514)" (ray-project#30709) This reverts commit 36aebcb. Reverts ray-project#30514 This is causing linux://python/ray/data:tests/test_transform_pyarrow to fail. Signed-off-by: Weichen Xu <[email protected]>

…ged tensor slices. (ray-project#30514)" (ray-project#30721) This unreverts ray-project#30514 by reverting commit 579770a. A test was merged into master while the original PR was open, which then broke when the original PR was merged. This wasn't caught in pre-merge checks since the PR was merged without having rebased onto latest master. Signed-off-by: Weichen Xu <[email protected]>

… slices. (ray-project#30514) Single-element ragged tensor slices (e.g. a[5:6]) currently have the wrong NumPy representation; namely, although they are object-dtyped, they have a multi-dimensional shape, and its single tensor element isn't well-typed (in other words, it doesn't use the pointer-to-subdarrays representation). This is due to np.array([subndarray], dtype=object) trying to create a more consolidated representation than np.array([subndarray1, subndarray2], dtype=object). This causes single-element batches of ragged tensor slices failing to eventually be put back into the tensor extension representation. This PR fixes this by doing a very explicit ragged tensor construction via the create-and-fill method: we allocate an empty, object-dtyped 1D array and fill it with the tensor elements. This prevents NumPy from trying to optimize the ragged tensor representation. Signed-off-by: tmynn <[email protected]>

…d tensor slices. (ray-project#30514)" (ray-project#30709) This reverts commit 36aebcb. Reverts ray-project#30514 This is causing linux://python/ray/data:tests/test_transform_pyarrow to fail. Signed-off-by: tmynn <[email protected]>

…ged tensor slices. (ray-project#30514)" (ray-project#30721) This unreverts ray-project#30514 by reverting commit 579770a. A test was merged into master while the original PR was open, which then broke when the original PR was merged. This wasn't caught in pre-merge checks since the PR was merged without having rebased onto latest master. Signed-off-by: tmynn <[email protected]>

clarkzinzow assigned c21, jiaodong, amogkam, bveeramani and jianoaix Nov 20, 2022

bveeramani approved these changes Nov 21, 2022

View reviewed changes

bveeramani mentioned this pull request Nov 22, 2022

[AIR] Add TorchVisionPreprocessor #30578

Merged

9 tasks

clarkzinzow added 2 commits November 22, 2022 14:18

Fix ndarray representation of single-element ragged tensor slices.

95af3b3

Create well-documented helper for strictly creating ragged ndarrays.

7410a0c

clarkzinzow force-pushed the datasets/fix/ragged-tensor-single-element-slice branch from 44c5b7e to 7410a0c Compare November 22, 2022 14:18

clarkzinzow merged commit 36aebcb into ray-project:master Nov 27, 2022

stephanie-wang added a commit that referenced this pull request Nov 28, 2022

Revert "[Datasets] Fix ndarray representation of single-element ragge…

0bc8e36

…d tensor slices. (#30514)" This reverts commit 36aebcb.

stephanie-wang mentioned this pull request Nov 28, 2022

Revert "[Datasets] Fix ndarray representation of single-element ragged tensor slices." #30709

Merged

clarkzinzow added a commit to clarkzinzow/ray that referenced this pull request Nov 29, 2022

Revert "Revert "[Datasets] Fix ndarray representation of single-eleme…

196a789

…nt ragged tensor slices. (ray-project#30514)" (ray-project#30709)" This reverts commit 579770a.

clarkzinzow mentioned this pull request Nov 29, 2022

[Datasets] Unrevert "Fix ndarray representation of single-element ragged tensor slices. (#30514)" #30721

Merged

7 tasks

clarkzinzow mentioned this pull request Nov 29, 2022

[Cherry-pick] [Datasets] Fix ndarray representation of single-element ragged tensor slices. (#30721) #30752

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Datasets] Fix ndarray representation of single-element ragged tensor slices. #30514

[Datasets] Fix ndarray representation of single-element ragged tensor slices. #30514

clarkzinzow commented Nov 20, 2022 •

edited

Loading

bveeramani left a comment

bveeramani Nov 21, 2022

clarkzinzow Nov 22, 2022

bveeramani Nov 21, 2022

clarkzinzow Nov 22, 2022 •

edited

Loading

[Datasets] Fix ndarray representation of single-element ragged tensor slices. #30514

[Datasets] Fix ndarray representation of single-element ragged tensor slices. #30514

Conversation

clarkzinzow commented Nov 20, 2022 • edited Loading

Example

N > 1 (Single-Element) Case

N == 1 (Multi-Element) Case

Related issue number

Checks

bveeramani left a comment

Choose a reason for hiding this comment

bveeramani Nov 21, 2022

Choose a reason for hiding this comment

clarkzinzow Nov 22, 2022

Choose a reason for hiding this comment

bveeramani Nov 21, 2022

Choose a reason for hiding this comment

clarkzinzow Nov 22, 2022 • edited Loading

Choose a reason for hiding this comment

clarkzinzow commented Nov 20, 2022 •

edited

Loading

clarkzinzow Nov 22, 2022 •

edited

Loading