Unify ArrowTensorType tables and Tensor blocks #18867

ericl · 2021-09-24T02:02:35Z

Why are these changes needed?

This PR removes the special-case Tensor block format from Datasets, in favor of representing these as single-column Arrow tables.

Before:

ray.data.range_tensor(10000)
# -> Dataset(num_blocks=200, num_rows=10000,
#            schema=<Tensor: shape=(None, 3, 5), dtype=int64>)

After:

ray.data.range_tensor(10000)
# -> Dataset(num_blocks=200, num_rows=10000,
#            schema={value: <ArrowTensorType: shape=(3, 5), dtype=int64>})

Read/write numpy operations have been changed to support Arrow tables with tensor columns.

This both removes a lot of special-case Tensor handling and cleans up the user story for tensor support. In the future, we can add more methods to make working with Tensors simpler on top of this common path.

scv119 · 2021-09-24T08:17:46Z

looks like a great simplification! I'd defer this to @clarkzinzow to review and accept.

doc/source/data/dataset-tensor-support.rst

clarkzinzow

This looks great! I'm glad that it turned out to be a pretty simple port.

clarkzinzow · 2021-09-27T19:03:40Z

python/ray/data/impl/arrow_block.py

+                    column, self._table.column_names))
+        array = self._table[column]
+        if array.num_chunks > 1:
+            # TODO(ekl) combine fails since we can't concat ArrowTensorType?


Hmm should be easy to add, we can look at supporting this.

Ok, do you have any pointers? I can give it a try

ericl added 3 commits September 23, 2021 19:02

wip

5ec634f

update

e280e03

wip

0a9200e

ericl changed the title ~~[WIP] Unify ArrowTensorType tables and Tensor blocks~~ Unify ArrowTensorType tables and Tensor blocks Sep 24, 2021

ericl assigned scv119 and clarkzinzow Sep 24, 2021

ericl added 4 commits September 23, 2021 21:22

fix

7eb0f2d

update

5fb2e81

fix

befd773

fix

b06e895

scv119 reviewed Sep 24, 2021

View reviewed changes

doc/source/data/dataset-tensor-support.rst Outdated Show resolved Hide resolved

ericl force-pushed the remove-t branch from bb97b2b to b37c500 Compare September 25, 2021 01:02

update

62cfbc8

ericl force-pushed the remove-t branch from b37c500 to 62cfbc8 Compare September 25, 2021 01:02

ericl added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Sep 26, 2021

clarkzinzow approved these changes Sep 27, 2021

View reviewed changes

ericl merged commit caf34a4 into ray-project:master Sep 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unify ArrowTensorType tables and Tensor blocks #18867

Unify ArrowTensorType tables and Tensor blocks #18867

ericl commented Sep 24, 2021 •

edited

Loading

scv119 commented Sep 24, 2021

clarkzinzow left a comment

clarkzinzow Sep 27, 2021

ericl Sep 27, 2021

Unify ArrowTensorType tables and Tensor blocks #18867

Unify ArrowTensorType tables and Tensor blocks #18867

Conversation

ericl commented Sep 24, 2021 • edited Loading

Why are these changes needed?

scv119 commented Sep 24, 2021

clarkzinzow left a comment

Choose a reason for hiding this comment

clarkzinzow Sep 27, 2021

Choose a reason for hiding this comment

ericl Sep 27, 2021

Choose a reason for hiding this comment

ericl commented Sep 24, 2021 •

edited

Loading