Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AIR - Datasets] [Hotfix] Tensor extension column concatenation fixes. #29479

Conversation

clarkzinzow
Copy link
Contributor

@clarkzinzow clarkzinzow commented Oct 19, 2022

Fixes the concatenation of tensor columns (and extension types in general). Previously, concatenating tensor columns may produce a variable-shaped tensor but would be represented with a broken homogeneous-shaped tensor. This PR ensures that the correct tensor extension type is produced post-concatenation.

Issue Details

The core issue is that we weren’t accounting for homogeneous-shaped tensor column concatenation resulting in a heterogeneous-shaped tensor column. In master, this currently fails silently until accessing the data fails somewhere downstream of the concatenation.

E.g., if you have 5 images of shape (32, 32) in one block, and 5 images of shape (64, 64) in another block, the image column in each individual block is homogeneous-shaped. But if you concatenate those blocks, you have 10 images, half with shape (32, 32) and half with shape (64, 64), which needs our heterogeneous-shaped (variable-shaped) tensor column representation.

Issue

Closes #29489

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@clarkzinzow clarkzinzow changed the title [AIR - Datasets] [Hotfix] Extension type concatenation fixes. [AIR - Datasets] [Hotfix] Tensor extension column concatenation fixes. Oct 20, 2022
Copy link
Contributor

@jianoaix jianoaix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this issue, just minor comment

python/ray/data/_internal/arrow_block.py Outdated Show resolved Hide resolved
Copy link
Contributor

@c21 c21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM beside @jianoaix's comments.

@clarkzinzow clarkzinzow force-pushed the datasets/fix/concatenating-tensor-columns branch from cf9eec1 to c382d5c Compare October 21, 2022 21:05
@clarkzinzow
Copy link
Contributor Author

@jianoaix Feedback implemented, PTAL. I opted to add test coverage for the particular concat() function since ArrowBlockBuilder delegates directly to that utility, adding test coverage for the block builders would have doubled the size of this PR and seemed out-of-scope, so I think this is a good compromise on coverage.

@clarkzinzow
Copy link
Contributor Author

I've been pushing for full unit test coverage of the block layer for a while, would definitely still like to do that in a dedicated PR.

@clarkzinzow
Copy link
Contributor Author

Failing tests appear to be unrelated, merging.

@clarkzinzow clarkzinzow merged commit 1c82427 into ray-project:master Oct 21, 2022
WeichenXu123 pushed a commit to WeichenXu123/ray that referenced this pull request Dec 19, 2022
ray-project#29479)

Fixes the concatenation of tensor columns (and extension types in general). Previously, concatenating tensor columns may produce a variable-shaped tensor but would be represented with a broken homogeneous-shaped tensor. This PR ensures that the correct tensor extension type is produced post-concatenation.

Issue Details:

The core issue is that we weren’t accounting for homogeneous-shaped tensor column concatenation resulting in a heterogeneous-shaped tensor column. In master, this currently fails silently until accessing the data fails somewhere downstream of the concatenation.

E.g., if you have 5 images of shape (32, 32) in one block, and 5 images of shape (64, 64) in another block, the image column in each individual block is homogeneous-shaped. But if you concatenate those blocks, you have 10 images, half with shape (32, 32) and half with shape (64, 64), which needs our heterogeneous-shaped (variable-shaped) tensor column representation.

Signed-off-by: Weichen Xu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Datasets] Concatenation of homogeneous-shaped tensor columns can produced variable-shaped tensor columns
4 participants