Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Datasets] Automatically cast tensor columns when building Pandas blocks. #26924

Merged
merged 3 commits into from
Jul 25, 2022

Conversation

matthewdeng
Copy link
Contributor

@matthewdeng matthewdeng commented Jul 23, 2022

Why are these changes needed?

This PR just applies the changes from the following PRs:

CI tests are all green: https://buildkite.com/ray-project/ray-builders-pr/builds/39699

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

…cks. (ray-project#26684)

This PR tries to automatically cast tensor columns to our TensorArray extension type when building Pandas blocks, logging a warning and falling back to the opaque object-typed column if the cast fails. This should allow users to remain mostly tensor extension agnostic.

TensorArray now eagerly validates the underlying tensor data, raising an error if e.g. the underlying ndarrays have heterogeneous shapes; previously, TensorArray wouldn't validate this on construction and would instead let failures happen downstream. This means that our internal TensorArray use needs to follow a try-except pattern, falling back to a plain NumPy object column.
@matthewdeng matthewdeng added the do-not-merge Do not merge this PR! label Jul 23, 2022
@matthewdeng matthewdeng changed the title [wip] cherry-pick tensordtype changes [Datasets] Automatically cast tensor columns when building Pandas blocks. Jul 23, 2022
@matthewdeng matthewdeng added tests-ok The tagger certifies test failures are unrelated and assumes personal liability. and removed do-not-merge Do not merge this PR! labels Jul 23, 2022
@matthewdeng matthewdeng marked this pull request as ready for review July 23, 2022 14:49
@scv119
Copy link
Contributor

scv119 commented Jul 23, 2022

you probably need @richardliaw approval.

@clarkzinzow
Copy link
Contributor

@matthewdeng is this good to merge?

except TypeError as e:
# Fall back to existing NumPy array.
if ray.util.log_once("datasets_tensor_array_cast_warning"):
logger.warning(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should make this an error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing this in a follow-up!

Rohan138 pushed a commit to Rohan138/ray that referenced this pull request Jul 28, 2022
…cks. (ray-project#26924)

This PR just applies the changes from the following PRs:

[Datasets] Automatically cast tensor columns when building Pandas blocks. ray-project#26684
reverted by Revert "[Datasets] Automatically cast tensor columns when building Pandas blocks." ray-project#26921
[AIR - Datasets] Fix TensorDtype construction from string and fix example. ray-project#26904
This fixes the test failures introduced in the originally reverted PRs.

Signed-off-by: Rohan138 <[email protected]>
Stefan-1313 pushed a commit to Stefan-1313/ray_mod that referenced this pull request Aug 18, 2022
…cks. (ray-project#26924)

This PR just applies the changes from the following PRs:

[Datasets] Automatically cast tensor columns when building Pandas blocks. ray-project#26684
reverted by Revert "[Datasets] Automatically cast tensor columns when building Pandas blocks." ray-project#26921
[AIR - Datasets] Fix TensorDtype construction from string and fix example. ray-project#26904
This fixes the test failures introduced in the originally reverted PRs.

Signed-off-by: Stefan van der Kleij <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants