-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Datasets] Fix ndarray representation of single-element ragged tensor slices. #30514
[Datasets] Fix ndarray representation of single-element ragged tensor slices. #30514
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM w/ comments
@@ -178,7 +178,14 @@ def test_arrow_variable_shaped_tensor_array_slice(): | |||
slice(0, 3), | |||
] | |||
for slice_ in slices: | |||
for o, e in zip(ata[slice_], arr[slice_]): | |||
ata_slice = ata[slice_] | |||
ata_slice_np = ata_slice.to_numpy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we converting to NumPy here so we can get the dtype
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to using NumPy as the "source of truth" for slicing semantics for multi-dimensional arrays (which we typically do), we're primarily interested in whether the slicing results in the expected semantics for the NumPy views of the data rather than the Arrow data, e.g. the fact that slicing preserves the Arrow-level typing doesn't need to be tested, that's guaranteed by Arrow's extension type slicing, but we need to make sure that the NumPy-level dtype is what we'd expect.
ata_slice_np = ata_slice.to_numpy() | ||
arr_slice = arr[slice_] | ||
# Check for equivalent dtypes and shapes. | ||
assert ata_slice_np.dtype == arr_slice.dtype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why doesn't ArrowTensorArrow
expose dtype
and shape
methods like TensorArray
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both the data type and element shape are exposed on the underlying tensor array type, i.e. ata.type.storage.value_type
for the Arrow type, ata.type.to_pandas_dtype().numpy_dtype
for the NumPy dtype, and ata.type.shape
for the element shape.
We haven't exposed them directly on ArrowTensorArray
under the assumption that the user will typically be working with the Pandas-side TensorArray
extension type (if they're interacting with tensor extension types at all; they're hidden by default), so the Arrow side isn't optimized for end-user interaction.
Agreed that making the Arrow-side extension type more user-friendly is worth doing though, we should open a ticket for that.
44c5b7e
to
7410a0c
Compare
…nt ragged tensor slices. (ray-project#30514)" (ray-project#30709)" This reverts commit 579770a.
…ged tensor slices. (#30514)" (#30721) This unreverts #30514 by reverting commit 579770a. A test was merged into master while the original PR was open, which then broke when the original PR was merged. This wasn't caught in pre-merge checks since the PR was merged without having rebased onto latest master.
…ged tensor slices. (ray-project#30514)" (ray-project#30721) This unreverts ray-project#30514 by reverting commit 579770a. A test was merged into master while the original PR was open, which then broke when the original PR was merged. This wasn't caught in pre-merge checks since the PR was merged without having rebased onto latest master.
…ged tensor slices. (#30514)" (#30721) (#30752) This unreverts #30514 by reverting commit 579770a. A test was merged into master while the original PR was open, which then broke when the original PR was merged. This wasn't caught in pre-merge checks since the PR was merged without having rebased onto latest master.
… slices. (ray-project#30514) Single-element ragged tensor slices (e.g. a[5:6]) currently have the wrong NumPy representation; namely, although they are object-dtyped, they have a multi-dimensional shape, and its single tensor element isn't well-typed (in other words, it doesn't use the pointer-to-subdarrays representation). This is due to np.array([subndarray], dtype=object) trying to create a more consolidated representation than np.array([subndarray1, subndarray2], dtype=object). This causes single-element batches of ragged tensor slices failing to eventually be put back into the tensor extension representation. This PR fixes this by doing a very explicit ragged tensor construction via the create-and-fill method: we allocate an empty, object-dtyped 1D array and fill it with the tensor elements. This prevents NumPy from trying to optimize the ragged tensor representation. Signed-off-by: Weichen Xu <[email protected]>
…d tensor slices. (ray-project#30514)" (ray-project#30709) This reverts commit 36aebcb. Reverts ray-project#30514 This is causing linux://python/ray/data:tests/test_transform_pyarrow to fail. Signed-off-by: Weichen Xu <[email protected]>
…ged tensor slices. (ray-project#30514)" (ray-project#30721) This unreverts ray-project#30514 by reverting commit 579770a. A test was merged into master while the original PR was open, which then broke when the original PR was merged. This wasn't caught in pre-merge checks since the PR was merged without having rebased onto latest master. Signed-off-by: Weichen Xu <[email protected]>
… slices. (ray-project#30514) Single-element ragged tensor slices (e.g. a[5:6]) currently have the wrong NumPy representation; namely, although they are object-dtyped, they have a multi-dimensional shape, and its single tensor element isn't well-typed (in other words, it doesn't use the pointer-to-subdarrays representation). This is due to np.array([subndarray], dtype=object) trying to create a more consolidated representation than np.array([subndarray1, subndarray2], dtype=object). This causes single-element batches of ragged tensor slices failing to eventually be put back into the tensor extension representation. This PR fixes this by doing a very explicit ragged tensor construction via the create-and-fill method: we allocate an empty, object-dtyped 1D array and fill it with the tensor elements. This prevents NumPy from trying to optimize the ragged tensor representation. Signed-off-by: tmynn <[email protected]>
…d tensor slices. (ray-project#30514)" (ray-project#30709) This reverts commit 36aebcb. Reverts ray-project#30514 This is causing linux://python/ray/data:tests/test_transform_pyarrow to fail. Signed-off-by: tmynn <[email protected]>
…ged tensor slices. (ray-project#30514)" (ray-project#30721) This unreverts ray-project#30514 by reverting commit 579770a. A test was merged into master while the original PR was open, which then broke when the original PR was merged. This wasn't caught in pre-merge checks since the PR was merged without having rebased onto latest master. Signed-off-by: tmynn <[email protected]>
Single-element ragged tensor slices (e.g.
a[5:6]
) currently have the wrong NumPy representation; namely, although they are object-dtyped, they have a multi-dimensional shape, and its single tensor element isn't well-typed (in other words, it doesn't use the pointer-to-subdarrays representation). This is due tonp.array([subndarray], dtype=object)
trying to create a more consolidated representation thannp.array([subndarray1, subndarray2], dtype=object)
. This causes single-element batches of ragged tensor slices failing to eventually be put back into the tensor extension representation.This PR fixes this by doing a very explicit ragged tensor construction via the create-and-fill method: we allocate an empty, object-dtyped 1D array and fill it with the tensor elements. This prevents NumPy from trying to optimize the ragged tensor representation.
Example
Have
subndarray = np.array([[1, 2], [3, 4]], dtype=np.int64)
andarr
be an(N, 2, 2)
ndarray.N > 1 (Single-Element) Case
With
arr = np.array([subndarray, subndarray], dtype=object])
, you getI.e., you get a 1D array of pointers to the subndarrays, so the subndarrays are well-typed, e.g.
arr[0].dtype == np.int64
.N == 1 (Multi-Element) Case
But the single-element case,
arr = np.array([subndarray], dtype=object)
, tries to consolidate the inner subndarray into the outer ndarray representation:The biggest impact of this is
arr[0].dtype == object
, which breaks a big tensor extension array assumption: that individual tensor elements are well-typed.Related issue number
Closes #30513, closes #30406, closes #30059
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.