-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Datasets] Fix boundary sampling concatenation. #20784
[Datasets] Fix boundary sampling concatenation. #20784
Conversation
Could you elaborate on the root cause? numpy cannot concatenate pyarrow tables with different rows? |
@jjyao NumPy can't reliably concatenate pyarrow tables with different number of rows, yes. It fails with a mismatched dimension error. |
This is weird. Numpy is able to concatenate 2d arrays with different number of rows. If numpy interprets pyarrow table correctly, it should be able to concatenate them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
My point here is that a naive
This obviously works with a non-naive
The remainder is then about doing a proper NumPy conversion. |
But aren't you dropping the dimension from the original tensor produced by
range_tensor? The shape is [Nrow, 1] not [Nrow].
…On Mon, Nov 29, 2021, 9:46 PM Clark Zinzow ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In python/ray/data/tests/test_dataset.py
<#20784 (comment)>:
> # Tensor Dataset
ds = ray.data.range_tensor(10, parallelism=2)
arr = np.concatenate(ray.get(ds.to_numpy_refs(column="value")))
- np.testing.assert_equal(arr, np.expand_dims(np.arange(0, 10), 1))
+ np.testing.assert_equal(arr, np.arange(0, 10))
I think that's because the original tensor had shape (1,), not (). So this
seems correct.
I'd think that the column selection with ds.to_numpy_refs(column="value")
would take the (1, num_rows) single-column table and return a (num_rows,)
1D ndarray.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#20784 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAADUSR3UVNRNGEXS35QNXDUORQKZANCNFSM5JAQI2EQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
Hmm isn't that just a representation/implementation detail? The real tensor that the user actually cares about is a 1D tensor, of shape Also, for my understanding, why do we need to represent tensors with that extra dimension? |
ab0e26e
to
80e2dba
Compare
Tensor shape is part of the tensor, we can't just drop dimensions as that would be like corrupting user data.
It's because we have to faithfully represent tensors of any dimensionality, regardless of whether the dimension seems "relevant". Btw we could also change the default of range_tensor() to have
|
@ericl l just realized that the Anyways, the original tensor semantics are being preserved so this discussion isn't blocking this PR. |
The Datasets tests pass so I think that this is ready to be merged once the remaining tests are done running. |
In
sample_boundaries
, naive concatenation withnp.concatenate()
doesn't work when the single-column sample blocks have varying lengths (e.g., when the original dataset had non-uniform blocks). This PR fixes this by delegating concatenation and NumPy array conversion to the block builder and block accessor, respectively.Checks
scripts/format.sh
to lint the changes in this PR.