-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Datasets] [Tensor Story - 1/2] Automatically provide tensor views to UDFs and infer tensor blocks for pure-tensor datasets. #24812
Changes from all commits
fadf917
94ecc45
50dfa53
daa3d15
4bf27a0
f3a2ac4
e36bd18
9a819f0
cb3e761
72f3dc1
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -15,22 +15,57 @@ Automatic conversion between the Pandas and Arrow extension types/arrays keeps t | |
Single-column tensor datasets | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
The most basic case is when a dataset only has a single column, which is of tensor type. This kind of dataset can be created with ``.range_tensor()``, and can be read from and written to ``.npy`` files. Here are some examples: | ||
The most basic case is when a dataset only has a single column, which is of tensor | ||
type. This kind of dataset can be: | ||
|
||
.. code-block:: python | ||
* created with :func:`range_tensor() <ray.data.range_tensor>` | ||
or :func:`from_numpy() <ray.data.from_numpy>`, | ||
* transformed with NumPy UDFs via | ||
:meth:`ds.map_batches() <ray.data.Dataset.map_batches>`, | ||
* consumed with :meth:`ds.iter_rows() <ray.data.Dataset.iter_rows>` and | ||
:meth:`ds.iter_batches() <ray.data.Dataset.iter_batches>`, and | ||
* can be read from and written to ``.npy`` files. | ||
|
||
# Create a Dataset of tensor-typed values. | ||
ds = ray.data.range_tensor(10000, shape=(3, 5)) | ||
# -> Dataset(num_blocks=200, num_rows=10000, | ||
# schema={value: <ArrowTensorType: shape=(3, 5), dtype=int64>}) | ||
Here is an end-to-end example: | ||
|
||
# Save to storage. | ||
ds.write_numpy("/tmp/tensor_out", column="value") | ||
.. code-block:: python | ||
|
||
# Read from storage. | ||
# Create a synthetic pure-tensor Dataset. | ||
ds = ray.data.range_tensor(10, shape=(3, 5)) | ||
# -> Dataset(num_blocks=10, num_rows=10, | ||
# schema={__value__: <ArrowTensorType: shape=(3, 5), dtype=int64>}) | ||
|
||
# Create a pure-tensor Dataset from an existing NumPy ndarray. | ||
arr = np.arange(10 * 3 * 5).reshape((10, 3, 5)) | ||
ds = ray.data.from_numpy(arr) | ||
# -> Dataset(num_blocks=1, num_rows=10, | ||
# schema={__value__: <ArrowTensorType: shape=(3, 5), dtype=int64>}) | ||
|
||
# Transform the tensors. Datasets will automatically unpack the single-column Arrow | ||
# table into a NumPy ndarray, provide that ndarray to your UDF, and then repack it | ||
# into a single-column Arrow table; this will be a zero-copy conversion in both | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do you mean this ndarray is a view on Arrow table, rather than a new memory instance? If not I don't think this is actually zero copy. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's zero-copy on the underlying array buffers; the conversion only involves switching to a different view on top of those array buffers (our single-tensor-column Arrow table vs. a NumPy ndarray). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. SG! |
||
# cases. | ||
ds = ds.map_batches(lambda arr: arr / arr.max()) | ||
# -> Dataset(num_blocks=1, num_rows=10, | ||
# schema={__value__: <ArrowTensorType: shape=(3, 5), dtype=double>}) | ||
|
||
# Consume the tensor. This will yield the underlying (3, 5) ndarrays. | ||
for arr in ds.iter_rows(): | ||
assert isinstance(arr, np.ndarray) | ||
assert arr.shape == (3, 5) | ||
|
||
# Consume the tensor in batches. | ||
for arr in ds.iter_batches(batch_size=2): | ||
assert isinstance(arr, np.ndarray) | ||
assert arr.shape == (2, 3, 5) | ||
|
||
# Save to storage. This will write out the blocks of the tensor column as NPY files. | ||
ds.write_numpy("/tmp/tensor_out") | ||
|
||
# Read back from storage. | ||
ray.data.read_numpy("/tmp/tensor_out") | ||
# -> Dataset(num_blocks=200, num_rows=?, | ||
# schema={value: <ArrowTensorType: shape=(3, 5), dtype=int64>}) | ||
# -> Dataset(num_blocks=1, num_rows=?, | ||
# schema={__value__: <ArrowTensorType: shape=(3, 5), dtype=double>}) | ||
|
||
Reading existing serialized tensor columns | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use literalinclude?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I matched the existing inline code blocks style to keep the PR focused and keep the diff small, hoping to do a port all of the code examples in this feature guide in a Working with Tensors feature guide overhaul.