Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Datasets] Fix boundary sampling concatenation. #20784

Merged

Conversation

clarkzinzow
Copy link
Contributor

In sample_boundaries, naive concatenation with np.concatenate() doesn't work when the single-column sample blocks have varying lengths (e.g., when the original dataset had non-uniform blocks). This PR fixes this by delegating concatenation and NumPy array conversion to the block builder and block accessor, respectively.

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Nov 30, 2021
@jjyao
Copy link
Collaborator

jjyao commented Nov 30, 2021

Could you elaborate on the root cause? numpy cannot concatenate pyarrow tables with different rows?

@clarkzinzow
Copy link
Contributor Author

clarkzinzow commented Nov 30, 2021

@jjyao NumPy can't reliably concatenate pyarrow tables with different number of rows, yes. It fails with a mismatched dimension error.

@clarkzinzow clarkzinzow removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Nov 30, 2021
@jjyao
Copy link
Collaborator

jjyao commented Nov 30, 2021

@jjyao NumPy can't reliably concatenate pyarrow tables with different number of rows, yes. It fails with a mismatched dimension error.

This is weird. Numpy is able to concatenate 2d arrays with different number of rows. If numpy interprets pyarrow table correctly, it should be able to concatenate them.

Copy link
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

python/ray/data/tests/test_dataset.py Outdated Show resolved Hide resolved
@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Nov 30, 2021
@clarkzinzow
Copy link
Contributor Author

clarkzinzow commented Nov 30, 2021

This is weird. Numpy is able to concatenate 2d arrays with different number of rows. If numpy interprets pyarrow table correctly, it should be able to concatenate them.

My point here is that a naive np.concatenate() (i.e. no axis= argument) won't properly concatenate the NumPy interpretation (np.asarray()) of single-column Arrow tables. Each of the single-column Arrow table blocks are converted into (1, block.num_rows) ndarrays, so you're trying to concatenate (1, block1.num_rows), ..., (1, blockn.num_rows) ndarrays; you can easily confirm that this doesn't work with a naive NumPy concatenate:

In [1]: import numpy as np
In [2]: arrs = [np.ones((1, 3)), np.ones((1, 5))]
In [3]: np.concatenate(arrs)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-4-71a15e1ac38f> in <module>
----> 1 np.concatenate(arrs)
<__array_function__ internals> in concatenate(*args, **kwargs)
ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 1, the array at index 0 has size 3 and the array at index 1 has size 5

This obviously works with a non-naive np.concatenate(arrs, axis=1), but that then breaks the simple block representation, which are 1D arrays that should be concatenated along axis=0. This is why I'm instead using the delegating block builder to build a single block, which does a block-aware concatenation that can adapt to the different representations under the hood:

  • pa.concat_tables() for Arrow blocks
  • List concatenation for simple blocks

The remainder is then about doing a proper NumPy conversion.

@ericl
Copy link
Contributor

ericl commented Nov 30, 2021 via email

@clarkzinzow
Copy link
Contributor Author

But aren't you dropping the dimension from the original tensor produced by
range_tensor? The shape is [Nrow, 1] not [Nrow].

Hmm isn't that just a representation/implementation detail? The real tensor that the user actually cares about is a 1D tensor, of shape (nrows,), and (nrows, 1) is just adding an extra redundant dimension due to some quirks in our tensor block representation, right?

Also, for my understanding, why do we need to represent tensors with that extra dimension?

@clarkzinzow clarkzinzow force-pushed the datasets/hotfix/aggregation-skewed-blocks branch from ab0e26e to 80e2dba Compare November 30, 2021 18:40
@clarkzinzow clarkzinzow removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Nov 30, 2021
@ericl
Copy link
Contributor

ericl commented Nov 30, 2021

Hmm isn't that just a representation/implementation detail? The real tensor that the user actually cares about is a 1D tensor, of shape (nrows,), and (nrows, 1) is just adding an extra redundant dimension due to some quirks in our tensor block representation, right?

Tensor shape is part of the tensor, we can't just drop dimensions as that would be like corrupting user data.

Also, for my understanding, why do we need to represent tensors with that extra dimension?

It's because we have to faithfully represent tensors of any dimensionality, regardless of whether the dimension seems "relevant". Btw we could also change the default of range_tensor() to have shape=[], which would do what you describe by default:

>>> ray.data.range_tensor(2, shape=[]).show()
{'value': array(0)}
{'value': array(1)}
>>> ray.data.range_tensor(2).show()
{'value': array([0])}
{'value': array([1])}

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Nov 30, 2021
@clarkzinzow
Copy link
Contributor Author

@ericl l just realized that the ray.data.range_tensor() API has a default per-row shape of (1,), not (), that's what I was missing. I had thought that the extra dimension was a quirk of our tensor block representation, not the actual generated tensor.

Anyways, the original tensor semantics are being preserved so this discussion isn't blocking this PR.

@clarkzinzow clarkzinzow added tests-ok The tagger certifies test failures are unrelated and assumes personal liability. and removed @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. labels Nov 30, 2021
@clarkzinzow
Copy link
Contributor Author

The Datasets tests pass so I think that this is ready to be merged once the remaining tests are done running.

@ericl ericl merged commit adbcc4f into ray-project:master Nov 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants