[Datasets] Fix boundary sampling concatenation. #20784

clarkzinzow · 2021-11-30T01:11:28Z

In sample_boundaries, naive concatenation with np.concatenate() doesn't work when the single-column sample blocks have varying lengths (e.g., when the original dataset had non-uniform blocks). This PR fixes this by delegating concatenation and NumPy array conversion to the block builder and block accessor, respectively.

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

python/ray/data/impl/arrow_block.py

jjyao · 2021-11-30T03:13:33Z

Could you elaborate on the root cause? numpy cannot concatenate pyarrow tables with different rows?

clarkzinzow · 2021-11-30T03:58:18Z

@jjyao NumPy can't reliably concatenate pyarrow tables with different number of rows, yes. It fails with a mismatched dimension error.

python/ray/data/tests/test_dataset.py

jjyao · 2021-11-30T04:11:06Z

@jjyao NumPy can't reliably concatenate pyarrow tables with different number of rows, yes. It fails with a mismatched dimension error.

This is weird. Numpy is able to concatenate 2d arrays with different number of rows. If numpy interprets pyarrow table correctly, it should be able to concatenate them.

ericl

LGTM

python/ray/data/tests/test_dataset.py

clarkzinzow · 2021-11-30T05:40:21Z

This is weird. Numpy is able to concatenate 2d arrays with different number of rows. If numpy interprets pyarrow table correctly, it should be able to concatenate them.

My point here is that a naive np.concatenate() (i.e. no axis= argument) won't properly concatenate the NumPy interpretation (np.asarray()) of single-column Arrow tables. Each of the single-column Arrow table blocks are converted into (1, block.num_rows) ndarrays, so you're trying to concatenate (1, block1.num_rows), ..., (1, blockn.num_rows) ndarrays; you can easily confirm that this doesn't work with a naive NumPy concatenate:

In [1]: import numpy as np
In [2]: arrs = [np.ones((1, 3)), np.ones((1, 5))]
In [3]: np.concatenate(arrs)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-4-71a15e1ac38f> in <module>
----> 1 np.concatenate(arrs)
<__array_function__ internals> in concatenate(*args, **kwargs)
ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 1, the array at index 0 has size 3 and the array at index 1 has size 5

This obviously works with a non-naive np.concatenate(arrs, axis=1), but that then breaks the simple block representation, which are 1D arrays that should be concatenated along axis=0. This is why I'm instead using the delegating block builder to build a single block, which does a block-aware concatenation that can adapt to the different representations under the hood:

pa.concat_tables() for Arrow blocks
List concatenation for simple blocks

The remainder is then about doing a proper NumPy conversion.

ericl · 2021-11-30T05:55:02Z

But aren't you dropping the dimension from the original tensor produced by range_tensor? The shape is [Nrow, 1] not [Nrow].

…

On Mon, Nov 29, 2021, 9:46 PM Clark Zinzow ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In python/ray/data/tests/test_dataset.py <#20784 (comment)>: > # Tensor Dataset ds = ray.data.range_tensor(10, parallelism=2) arr = np.concatenate(ray.get(ds.to_numpy_refs(column="value"))) - np.testing.assert_equal(arr, np.expand_dims(np.arange(0, 10), 1)) + np.testing.assert_equal(arr, np.arange(0, 10)) I think that's because the original tensor had shape (1,), not (). So this seems correct. I'd think that the column selection with ds.to_numpy_refs(column="value") would take the (1, num_rows) single-column table and return a (num_rows,) 1D ndarray. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#20784 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAADUSR3UVNRNGEXS35QNXDUORQKZANCNFSM5JAQI2EQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

clarkzinzow · 2021-11-30T06:14:50Z

But aren't you dropping the dimension from the original tensor produced by
range_tensor? The shape is [Nrow, 1] not [Nrow].

Hmm isn't that just a representation/implementation detail? The real tensor that the user actually cares about is a 1D tensor, of shape (nrows,), and (nrows, 1) is just adding an extra redundant dimension due to some quirks in our tensor block representation, right?

Also, for my understanding, why do we need to represent tensors with that extra dimension?

ericl · 2021-11-30T19:49:12Z

Hmm isn't that just a representation/implementation detail? The real tensor that the user actually cares about is a 1D tensor, of shape (nrows,), and (nrows, 1) is just adding an extra redundant dimension due to some quirks in our tensor block representation, right?

Tensor shape is part of the tensor, we can't just drop dimensions as that would be like corrupting user data.

Also, for my understanding, why do we need to represent tensors with that extra dimension?

It's because we have to faithfully represent tensors of any dimensionality, regardless of whether the dimension seems "relevant". Btw we could also change the default of range_tensor() to have shape=[], which would do what you describe by default:

>>> ray.data.range_tensor(2, shape=[]).show()
{'value': array(0)}
{'value': array(1)}
>>> ray.data.range_tensor(2).show()
{'value': array([0])}
{'value': array([1])}

clarkzinzow · 2021-11-30T20:18:08Z

@ericl l just realized that the ray.data.range_tensor() API has a default per-row shape of (1,), not (), that's what I was missing. I had thought that the extra dimension was a quirk of our tensor block representation, not the actual generated tensor.

Anyways, the original tensor semantics are being preserved so this discussion isn't blocking this PR.

clarkzinzow · 2021-11-30T20:18:58Z

The Datasets tests pass so I think that this is ready to be merged once the remaining tests are done running.

Fix boundary sampling concatenation.

2050d52

clarkzinzow requested review from ericl and scv119 as code owners November 30, 2021 01:11

clarkzinzow assigned ericl, scv119 and jjyao Nov 30, 2021

clarkzinzow commented Nov 30, 2021

View reviewed changes

python/ray/data/impl/arrow_block.py Outdated Show resolved Hide resolved

ericl reviewed Nov 30, 2021

View reviewed changes

python/ray/data/impl/arrow_block.py Outdated Show resolved Hide resolved

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Nov 30, 2021

Preserve to_numpy_refs() semantics for Arrow tables.

eb06d9d

clarkzinzow removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Nov 30, 2021

clarkzinzow requested a review from ericl November 30, 2021 03:58

clarkzinzow commented Nov 30, 2021

View reviewed changes

python/ray/data/tests/test_dataset.py Outdated Show resolved Hide resolved

clarkzinzow mentioned this pull request Nov 30, 2021

[Datasets] Support ignoring NaNs in aggregations. #20787

Merged

7 tasks

ericl approved these changes Nov 30, 2021

View reviewed changes

python/ray/data/tests/test_dataset.py Outdated Show resolved Hide resolved

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Nov 30, 2021

Revert back to chunk combining/selecting.

80e2dba

clarkzinzow force-pushed the datasets/hotfix/aggregation-skewed-blocks branch from ab0e26e to 80e2dba Compare November 30, 2021 18:40

clarkzinzow removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Nov 30, 2021

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Nov 30, 2021

clarkzinzow added tests-ok The tagger certifies test failures are unrelated and assumes personal liability. and removed @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. labels Nov 30, 2021

ericl merged commit adbcc4f into ray-project:master Nov 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Datasets] Fix boundary sampling concatenation. #20784

[Datasets] Fix boundary sampling concatenation. #20784

clarkzinzow commented Nov 30, 2021

jjyao commented Nov 30, 2021

clarkzinzow commented Nov 30, 2021 •

edited

Loading

jjyao commented Nov 30, 2021

ericl left a comment

clarkzinzow commented Nov 30, 2021 •

edited

Loading

ericl commented Nov 30, 2021 via email

clarkzinzow commented Nov 30, 2021

ericl commented Nov 30, 2021 •

edited

Loading

clarkzinzow commented Nov 30, 2021

clarkzinzow commented Nov 30, 2021

[Datasets] Fix boundary sampling concatenation. #20784

[Datasets] Fix boundary sampling concatenation. #20784

Conversation

clarkzinzow commented Nov 30, 2021

Checks

jjyao commented Nov 30, 2021

clarkzinzow commented Nov 30, 2021 • edited Loading

jjyao commented Nov 30, 2021

ericl left a comment

Choose a reason for hiding this comment

clarkzinzow commented Nov 30, 2021 • edited Loading

ericl commented Nov 30, 2021 via email

clarkzinzow commented Nov 30, 2021

ericl commented Nov 30, 2021 • edited Loading

clarkzinzow commented Nov 30, 2021

clarkzinzow commented Nov 30, 2021

clarkzinzow commented Nov 30, 2021 •

edited

Loading

clarkzinzow commented Nov 30, 2021 •

edited

Loading

ericl commented Nov 30, 2021 •

edited

Loading