Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Datasets] Fix ndarray representation of single-element ragged tensor slices. #30514

Conversation

clarkzinzow
Copy link
Contributor

@clarkzinzow clarkzinzow commented Nov 20, 2022

Single-element ragged tensor slices (e.g. a[5:6]) currently have the wrong NumPy representation; namely, although they are object-dtyped, they have a multi-dimensional shape, and its single tensor element isn't well-typed (in other words, it doesn't use the pointer-to-subdarrays representation). This is due to np.array([subndarray], dtype=object) trying to create a more consolidated representation than np.array([subndarray1, subndarray2], dtype=object). This causes single-element batches of ragged tensor slices failing to eventually be put back into the tensor extension representation.

This PR fixes this by doing a very explicit ragged tensor construction via the create-and-fill method: we allocate an empty, object-dtyped 1D array and fill it with the tensor elements. This prevents NumPy from trying to optimize the ragged tensor representation.

Example

Have subndarray = np.array([[1, 2], [3, 4]], dtype=np.int64) and arr be an (N, 2, 2) ndarray.

N > 1 (Single-Element) Case

With arr = np.array([subndarray, subndarray], dtype=object]), you get

ndarray(
    [
        ndarray([[1, 2], [3, 4]], dtype=np.int64),
        ndarray([[1, 2], [3, 4]], dtype=np.int64),
    ],
    dtype=object
)

I.e., you get a 1D array of pointers to the subndarrays, so the subndarrays are well-typed, e.g. arr[0].dtype == np.int64.

N == 1 (Multi-Element) Case

But the single-element case, arr = np.array([subndarray], dtype=object), tries to consolidate the inner subndarray into the outer ndarray representation:

ndarray([[[1, 2], [3, 4]]], dtype=object)

The biggest impact of this is arr[0].dtype == object, which breaks a big tensor extension array assumption: that individual tensor elements are well-typed.

Related issue number

Closes #30513, closes #30406, closes #30059

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Copy link
Member

@bveeramani bveeramani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM w/ comments

@@ -178,7 +178,14 @@ def test_arrow_variable_shaped_tensor_array_slice():
slice(0, 3),
]
for slice_ in slices:
for o, e in zip(ata[slice_], arr[slice_]):
ata_slice = ata[slice_]
ata_slice_np = ata_slice.to_numpy()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we converting to NumPy here so we can get the dtype?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to using NumPy as the "source of truth" for slicing semantics for multi-dimensional arrays (which we typically do), we're primarily interested in whether the slicing results in the expected semantics for the NumPy views of the data rather than the Arrow data, e.g. the fact that slicing preserves the Arrow-level typing doesn't need to be tested, that's guaranteed by Arrow's extension type slicing, but we need to make sure that the NumPy-level dtype is what we'd expect.

ata_slice_np = ata_slice.to_numpy()
arr_slice = arr[slice_]
# Check for equivalent dtypes and shapes.
assert ata_slice_np.dtype == arr_slice.dtype
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why doesn't ArrowTensorArrow expose dtype and shape methods like TensorArray?

Copy link
Contributor Author

@clarkzinzow clarkzinzow Nov 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both the data type and element shape are exposed on the underlying tensor array type, i.e. ata.type.storage.value_type for the Arrow type, ata.type.to_pandas_dtype().numpy_dtype for the NumPy dtype, and ata.type.shape for the element shape.

We haven't exposed them directly on ArrowTensorArray under the assumption that the user will typically be working with the Pandas-side TensorArray extension type (if they're interacting with tensor extension types at all; they're hidden by default), so the Arrow side isn't optimized for end-user interaction.

Agreed that making the Arrow-side extension type more user-friendly is worth doing though, we should open a ticket for that.

@clarkzinzow clarkzinzow force-pushed the datasets/fix/ragged-tensor-single-element-slice branch from 44c5b7e to 7410a0c Compare November 22, 2022 14:18
@clarkzinzow clarkzinzow merged commit 36aebcb into ray-project:master Nov 27, 2022
stephanie-wang added a commit that referenced this pull request Nov 28, 2022
stephanie-wang added a commit that referenced this pull request Nov 28, 2022
…d tensor slices. (#30514)" (#30709)

This reverts commit 36aebcb.

Reverts #30514

This is causing linux://python/ray/data:tests/test_transform_pyarrow to fail.
clarkzinzow added a commit to clarkzinzow/ray that referenced this pull request Nov 29, 2022
clarkzinzow added a commit that referenced this pull request Nov 29, 2022
…ged tensor slices. (#30514)" (#30721)

This unreverts #30514 by reverting commit 579770a.

A test was merged into master while the original PR was open, which then broke when the original PR was merged. This wasn't caught in pre-merge checks since the PR was merged without having rebased onto latest master.
clarkzinzow added a commit to clarkzinzow/ray that referenced this pull request Nov 29, 2022
…ged tensor slices. (ray-project#30514)" (ray-project#30721)

This unreverts ray-project#30514 by reverting commit 579770a.

A test was merged into master while the original PR was open, which then broke when the original PR was merged. This wasn't caught in pre-merge checks since the PR was merged without having rebased onto latest master.
clarkzinzow added a commit that referenced this pull request Nov 30, 2022
…ged tensor slices. (#30514)" (#30721) (#30752)

This unreverts #30514 by reverting commit 579770a.

A test was merged into master while the original PR was open, which then broke when the original PR was merged. This wasn't caught in pre-merge checks since the PR was merged without having rebased onto latest master.
WeichenXu123 pushed a commit to WeichenXu123/ray that referenced this pull request Dec 19, 2022
… slices. (ray-project#30514)

Single-element ragged tensor slices (e.g. a[5:6]) currently have the wrong NumPy representation; namely, although they are object-dtyped, they have a multi-dimensional shape, and its single tensor element isn't well-typed (in other words, it doesn't use the pointer-to-subdarrays representation). This is due to np.array([subndarray], dtype=object) trying to create a more consolidated representation than np.array([subndarray1, subndarray2], dtype=object). This causes single-element batches of ragged tensor slices failing to eventually be put back into the tensor extension representation.

This PR fixes this by doing a very explicit ragged tensor construction via the create-and-fill method: we allocate an empty, object-dtyped 1D array and fill it with the tensor elements. This prevents NumPy from trying to optimize the ragged tensor representation.

Signed-off-by: Weichen Xu <[email protected]>
WeichenXu123 pushed a commit to WeichenXu123/ray that referenced this pull request Dec 19, 2022
…d tensor slices. (ray-project#30514)" (ray-project#30709)

This reverts commit 36aebcb.

Reverts ray-project#30514

This is causing linux://python/ray/data:tests/test_transform_pyarrow to fail.

Signed-off-by: Weichen Xu <[email protected]>
WeichenXu123 pushed a commit to WeichenXu123/ray that referenced this pull request Dec 19, 2022
…ged tensor slices. (ray-project#30514)" (ray-project#30721)

This unreverts ray-project#30514 by reverting commit 579770a.

A test was merged into master while the original PR was open, which then broke when the original PR was merged. This wasn't caught in pre-merge checks since the PR was merged without having rebased onto latest master.

Signed-off-by: Weichen Xu <[email protected]>
tamohannes pushed a commit to ju2ez/ray that referenced this pull request Jan 25, 2023
… slices. (ray-project#30514)

Single-element ragged tensor slices (e.g. a[5:6]) currently have the wrong NumPy representation; namely, although they are object-dtyped, they have a multi-dimensional shape, and its single tensor element isn't well-typed (in other words, it doesn't use the pointer-to-subdarrays representation). This is due to np.array([subndarray], dtype=object) trying to create a more consolidated representation than np.array([subndarray1, subndarray2], dtype=object). This causes single-element batches of ragged tensor slices failing to eventually be put back into the tensor extension representation.

This PR fixes this by doing a very explicit ragged tensor construction via the create-and-fill method: we allocate an empty, object-dtyped 1D array and fill it with the tensor elements. This prevents NumPy from trying to optimize the ragged tensor representation.

Signed-off-by: tmynn <[email protected]>
tamohannes pushed a commit to ju2ez/ray that referenced this pull request Jan 25, 2023
…d tensor slices. (ray-project#30514)" (ray-project#30709)

This reverts commit 36aebcb.

Reverts ray-project#30514

This is causing linux://python/ray/data:tests/test_transform_pyarrow to fail.

Signed-off-by: tmynn <[email protected]>
tamohannes pushed a commit to ju2ez/ray that referenced this pull request Jan 25, 2023
…ged tensor slices. (ray-project#30514)" (ray-project#30721)

This unreverts ray-project#30514 by reverting commit 579770a.

A test was merged into master while the original PR was open, which then broke when the original PR was merged. This wasn't caught in pre-merge checks since the PR was merged without having rebased onto latest master.

Signed-off-by: tmynn <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants