Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unify ArrowTensorType tables and Tensor blocks #18867

Merged
merged 8 commits into from
Sep 27, 2021
Merged

Conversation

ericl
Copy link
Contributor

@ericl ericl commented Sep 24, 2021

Why are these changes needed?

This PR removes the special-case Tensor block format from Datasets, in favor of representing these as single-column Arrow tables.

Before:

ray.data.range_tensor(10000)
# -> Dataset(num_blocks=200, num_rows=10000,
#            schema=<Tensor: shape=(None, 3, 5), dtype=int64>)

After:

ray.data.range_tensor(10000)
# -> Dataset(num_blocks=200, num_rows=10000,
#            schema={value: <ArrowTensorType: shape=(3, 5), dtype=int64>})

Read/write numpy operations have been changed to support Arrow tables with tensor columns.

This both removes a lot of special-case Tensor handling and cleans up the user story for tensor support. In the future, we can add more methods to make working with Tensors simpler on top of this common path.

@ericl ericl changed the title [WIP] Unify ArrowTensorType tables and Tensor blocks Unify ArrowTensorType tables and Tensor blocks Sep 24, 2021
@scv119
Copy link
Contributor

scv119 commented Sep 24, 2021

looks like a great simplification! I'd defer this to @clarkzinzow to review and accept.

@ericl ericl added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Sep 26, 2021
Copy link
Contributor

@clarkzinzow clarkzinzow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! I'm glad that it turned out to be a pretty simple port.

column, self._table.column_names))
array = self._table[column]
if array.num_chunks > 1:
# TODO(ekl) combine fails since we can't concat ArrowTensorType?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm should be easy to add, we can look at supporting this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, do you have any pointers? I can give it a try

@ericl ericl merged commit caf34a4 into ray-project:master Sep 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants