[Datasets] [Tensor Story - 1/2] Automatically provide tensor views to UDFs and infer tensor blocks for pure-tensor datasets. #24812

clarkzinzow · 2022-05-15T02:25:01Z

This PR makes several improvements to the Datasets tensor story. See the issues for each item for more details.

Automatically infer tensor blocks (single-column tables representing a single tensor) when returning NumPy ndarrays from map_batches(), map(), and flat_map(): [Datasets] Automatically infer tensor blocks when returning NumPy ndarrays from .map_batches() #24208
Automatically infer tensor columns when building tabular blocks in general: [Datasets] Automatically infer tensor columns when building tabular blocks #24207
Fixes shuffling and sorting for tensor columns

This should improve the UX/efficiency of the following:

Working with pure-tensor datasets in general.
Mapping tensor UDFs over pure-tensor, a better foundation for tensor-native preprocessing for end-users and AIR.

Example of the tensor-native UDF UX

Below is a motivating example for the automatic NumPy conversion for the default "native" format, showing the single-tensor-column case.

Before

ds = ray.data.range_tensor(16, shape=(4, 4), parallelism=4)

def mapper(df: pd.DataFrame) -> pd.DataFrame:
    # I have to know about this `"value"` column and the
    # internal tensor representation as this single-column table.
    df["value"] *= 2
    return df

# I have to convert it to Pandas even though I'm working with tensor data,
# which is not the API that I want to work with AND incurs expensive
# copies during the Arrow --> Pandas conversion.
ds = ds.map_batches(mapper, batch_size=2, batch_format="pandas")

for batch in ds.iter_batches(batch_size=2, batch_format="pandas"):
    # I again have to convert the batches to Pandas and then access this
    # dummy column.
    actual_batch = batch["value"]

After

ds = ray.data.range_tensor(16, shape=(4, 4), parallelism=4)

# UDF now takes and returns an ndarray, which will be zero-copy access
# to the underlying Arrow table, providing far better UX and less
# memory/perf overhead. The returned ndarray will be repacked into an
# Arrow table (again, zero-copy).
ds = ds.map_batches(lambda arr: 2*arr, batch_size=2)

for batch in ds.iter_batches(batch_size=2):
    # The batch is the underlying ndarray batch, as I would expect; I don't
    # have to know or care about the dummy column or the single-column
    # table representation.
    assert isinstance(batch, np.ndarray)

Changed Semantics

This PR changes the semantics on how the default/"native" format is interpreted for pure-tensor datasets; i.e. datasets consisting of a single tensor column in an Arrow/Pandas table, with a dummy internal column name, such as those created by ray.data.range_tensor() or ray.data.from_numpy(). These changed semantics are described below for per-row UDFs, batch UDFs, and per-row and batch iterators.

The following are changed semantics for how UDF returns are interpreted.

A per-row UDF returns a NumPy ndarray:
- Before: Creates a simple block (list) of ndarrays.
- After: Creates a single-tensor-column Arrow table.
A batch UDF returns a NumPy ndarray:
- Before: Raises an error.
- After: Creates a single-tensor-column Arrow table.
A batch UDF returns a list of NumPy ndarrays:
- Before: Creates a simple block (list) of ndarrays.
- After: Creates a single-tensor-column Arrow table.

The following are changed semantics for the format in which UDF data argument is supplied.

A per-row UDF is called on a pure-tensor dataset.
- Before: UDF receives a single-column PandasRow containing the tensor column element.
- After: UDF receives the tensor column element as a NumPy ndarray.
A batch UDF is called on a pure-tensor dataset:
- Before: UDF receives a single-column Pandas table containing the tensor column.
- After: UDF receives the tensor column as a NumPy ndarray.

The following are changed semantics for the format of consumed batches:

A row iterator is consumed on a pure-tensor dataset:
- Before: A single column tabular row (either PandasRow or ArrowRow) is returned containing the tensor column element.
- After: The tensor column element is returned as a NumPy ndarray.
A batch iterator is consumed on a pure-tensor dataset:
- Before: A single-column Pandas table is returned containing the tensor column.
- After: The tensor column is returned as a NumPy ndarray.

Related issue number

Closes #24208, #24207

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

ericl

Main comment is to rejigger this slightly to be more generalizable to handling simple block values in the same way in the future.

python/ray/data/dataset.py

clarkzinzow · 2022-05-17T01:00:49Z

@ericl PR is updated with:

TENSOR_COL_NAME = "__RAY_TC__"" --> VALUE_COL_NAME = "__value__"
"numpy" batch format is removed (for follow-up PR)

I'll update the PR description with this downscoping once I open up the stacked follow-up PR (I'm going to want to copy some of this PR description over).

python/ray/data/dataset.py

python/ray/data/impl/table_block.py

python/ray/data/tests/test_dataset.py

ericl

Looks good at high level to me.

clarkzinzow · 2022-05-17T02:34:25Z

@ericl Feedback addressed! I've also opened a follow-up PR adding the "numpy" format, stacked on this one. #24870

clarkzinzow · 2022-05-17T19:08:35Z

@ericl @jianoaix Ready for review, PTAL

clarkzinzow · 2022-05-19T21:47:02Z

@ericl @jianoaix I think those last few commits fixed the remaining bugs after the delegating block builder refactor, should know when the Datasets CI job finishes. My laptop is dead and I won't be able to charge it for another ~hours, so can y'all get this across the line?

jianoaix

LG from my side.

jianoaix · 2022-05-18T18:53:00Z

doc/source/data/dataset-tensor-support.rst


-    # Save to storage.
-    ds.write_numpy("/tmp/tensor_out", column="value")
+.. code-block:: python


Use literalinclude?

I matched the existing inline code blocks style to keep the PR focused and keep the diff small, hoping to do a port all of the code examples in this feature guide in a Working with Tensors feature guide overhaul.

clarkzinzow · 2022-05-20T05:29:51Z

Datasets failure is unrelated, it's a flaky test that was skipped, but that skip was recently reverted here: 2be45fe. Other failures appear to be unrelated, so I'm going to merge this.

I'll open a separate PR to skip that test.

clarkzinzow · 2022-05-20T06:23:41Z

PR to skip test here: #25009

… UDFs and infer tensor blocks for pure-tensor datasets. (ray-project#24812) This PR makes several improvements to the Datasets' tensor story. See the issues for each item for more details. - Automatically infer tensor blocks (single-column tables representing a single tensor) when returning NumPy ndarrays from map_batches(), map(), and flat_map(). - Automatically infer tensor columns when building tabular blocks in general. - Fixes shuffling and sorting for tensor columns This should improve the UX/efficiency of the following: - Working with pure-tensor datasets in general. - Mapping tensor UDFs over pure-tensor, a better foundation for tensor-native preprocessing for end-users and AIR.

…views to UDFs and infer tensor blocks for pure-tensor datasets. (#24812)" This reverts commit 841f7c8.

…views to UDFs and infer tensor blocks for pure-tensor datasets. (#24812)" (#25017) This reverts commit 841f7c8. Reverts #24812 Broke e.g. ML tests: https://buildkite.com/ray-project/ray-builders-branch/builds/7667#55e7473e-f6a8-4d72-a875-cd68acf8b0c4

… tensor views to UDFs and infer tensor blocks for pure-tensor datasets. (ray-project#24812)" (ray-project#25017)" This reverts commit fbfb134.

…A. (#25010) * [Datasets] Add `from_huggingface` for Hugging Face datasets integration (#24464) Adds a from_huggingface method to Datasets, which allows the conversion of a Hugging Face Dataset to a Ray Dataset. As a Hugging Face Dataset is backed by an Arrow table, the conversion is trivial. * Test the CSV read with column types specified (#24398) Make sure users can read csv with columns types specified. Users may want to do this because sometimes PyArrow's type inference doesn't work as intended, in which case users can step in and work around the type inference. * [Datasets] [Docs] Add a warning about from_huggingface (#24608) Adds a warning to docs about the intended use of from_huggingface. * [data] Expose `drop_last` in `to_tf` (#24666) * [data] More informative exceptions in block impl (#24665) * Add a classic yet small-sized ML dataset for demo/documentation/testing (#24592) To facilitate easy demo/documentation/testing with realistic, small-sized yet ML-familiar data. Have it as a source file with code will make it self-contained, i.e. after user "pip install" Ray, they are all set to run it. IRIS is a great fit: super classic ML dataset, simple schema, only 150 rows. * [Datasets] Add more example data. (#24795) This PR adds more example data for ongoing feature guide work. In addition to adding the new datasets, this also puts all example data under examples/data in order to separate it from the example code. * [Datasets] Add example protocol for reading canned in-package example data. (#24800) Providing easy-access datasets is table stakes for a good Getting Started UX, but even with good in-package data, it can be difficult to make these paths accessible to the user. This PR adds an "example://" protocol that will resolve passed paths directly to our canned in-package example data. * [minor] Use np.searchsorted to speed up random access dataset (#24825) * [Datasets] Change `range_arrow()` API to `range_table()` (#24704) This PR changes the ray.data.range_arrow() to ray.data.range_table(), making the Arrow representation an implementation detail. * [Datasets] Support tensor columns in `to_tf` and `to_torch`. (#24752) This PR adds support for tensor columns in the to_tf() and to_torch() APIs. For Torch, this involves an explicit extension array check and (zero-copy) conversion of the tensor column to a NumPy array before converting the column to a Torch tensor. For TensorFlow, this involves bypassing df.values when converting tensor feature columns to NumPy arrays, instead manually creating a single NumPy array from the column Series. In both cases, I think that the UX around heterogeneous feature columns and squeezing the column dimension could be improved, but I'm saving that for a future PR. * Implement random_sample() (#24492) * Map progress bar title; pretty repr for rows. (#24672) * [Datasets] [CI] fix CI of dataset test (#24883) CI test is broken by f61caa3. This PR fixes it. * [Datasets] Add explicit resource allocation option via a top-level scheduling strategy (#24438) Instead of letting Datasets implicitly use cluster resources in the margins of explicit allocations of other libraries, such as Tune, Datasets should provide an option for explicitly allocating resources for a Datasets workload for users that want to box Datasets in. This PR adds such an explicit resource allocation option, via exposing a top-level scheduling strategy on the DatasetContext with which a placement group can be given. * [Datasets] Add example of using `map_batches` to filter (#24202) The documentation says > Consider using .map_batches() for better performance (you can implement filter by dropping records). but there aren't any examples of how to do so. * [doc] Add docs for push-based shuffle in Datasets (#24486) Adds recommendations, example, and brief benchmark results for push-based shuffle in Datasets. * [Doc][Data] fix big-data-ingestion broken links (#24631) The links were broken. Fixed it. * [docs] Fix import error in Ray Data "getting started" (#24424) We did `import pandas as pd` but here we are using it as `pandas` * [Datasets] Overhaul of "Creating Datasets" feature guide. (#24831) This PR is a general overhaul of the "Creating Datasets" feature guide, providing complete coverage of all (public) dataset creation APIs and highlighting features and quirks of the individual APIs, data modalities, storage backends, etc. In order to keep the page from getting too long and keeping it easy to navigate, tabbed views are used heavily. * [Datasets] Add basic data ecosystem overview, user guide links, other data processing options card. (#23346) * Revamp the Getting Started page for Dataset (#24860) This is part of the Dataset GA doc fix effort to update/improve the documentation. This PR revamps the Getting Started page. What are the changes: - Focus on basic/core features that are bread-and-butter for users, leave the advanced features out - Focus on high level introduction, leave the detailed spec out (e.g. what are possible batch_types for map_batches() API) - Use more realistic (yet still simple) data example that's familiar to people (IRIS dataset in this case) - Use the same data example throughout to make it context-switch free - Use runnable code rather than faked - Reference to the code from doc, instead of inlining them in the doc Co-authored-by: Ubuntu <[email protected]> Co-authored-by: Eric Liang <[email protected]> * [Datasets] Miscellaneous GA docs P0s. (#24891) This PR knocks off a few miscellaneous GA docs P0s given in our docs tracker. Namely: - Documents Datasets resource allocation model. - De-emphasizes global/windowed shuffling. - Documents lazy execution mode, and expands our execution model docs in general. * [docs] After careful consideration, choose the lesser of two evils and set white-space: pre-wrap #24873 * [Datasets] [Tensor Story - 1/2] Automatically provide tensor views to UDFs and infer tensor blocks for pure-tensor datasets. (#24812) This PR makes several improvements to the Datasets' tensor story. See the issues for each item for more details. - Automatically infer tensor blocks (single-column tables representing a single tensor) when returning NumPy ndarrays from map_batches(), map(), and flat_map(). - Automatically infer tensor columns when building tabular blocks in general. - Fixes shuffling and sorting for tensor columns This should improve the UX/efficiency of the following: - Working with pure-tensor datasets in general. - Mapping tensor UDFs over pure-tensor, a better foundation for tensor-native preprocessing for end-users and AIR. * [Datasets] Overhaul "Accessing Datasets" feature guide. (#24963) This PR overhauls the "Accessing Datasets", adding proper coverage of each data consuming methods, including the ML framework exchange APIs (to_torch() and to_tf()). * [Datasets] Add FAQ to Datasets docs. (#24932) This PR adds a FAQ to Datasets docs. Docs preview: https://ray--24932.org.readthedocs.build/en/24932/ - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Co-authored-by: Eric Liang <[email protected]> * [Datasets] Add basic e2e Datasets example on NYC taxi dataset (#24874) This PR adds a dedicated docs page for examples, and adds a basic e2e tabular data processing example on the NYC taxi dataset. The goal of this example is to demonstrate basic data reading, inspection, transformations, and shuffling, along with ingestion into dummy model trainers and doing dummy batch inference, for tabular (Parquet) data. * Revamp the Datasets API docstrings (#24949) * Revamp the Saving Datasets user guide (#24987) * Fix AIR references in Datasets FAQ. * [Datasets] Skip flaky pipelining memory release test (#25009) This pipelining memory release test is flaky; it was skipped in this Polars PR, which was then reverted. * Note that explicit resource allocation is experimental, fix typos (#25038) * fix the notebook test failure * no-op indent fix * fix notebooks test #2 * Revamp the Transforming Datasets user guide (#25033) * Fix range_arrow(), which is replaced by range_table() (#25036) * indent * allow empty * Proofread the some datasets docs (#25068) Co-authored-by: Ubuntu <[email protected]> * [Data] Add partitioning classes to Data API reference (#24203) Co-authored-by: Antoni Baum <[email protected]> Co-authored-by: Jian Xiao <[email protected]> Co-authored-by: Eric Liang <[email protected]> Co-authored-by: Robert <[email protected]> Co-authored-by: Balaji Veeramani <[email protected]> Co-authored-by: Stephanie Wang <[email protected]> Co-authored-by: Chen Shen <[email protected]> Co-authored-by: Zhe Zhang <[email protected]> Co-authored-by: Ubuntu <[email protected]>

…ovide tensor views to UDFs and infer tensor blocks for pure-tensor datasets. (#25031)" (#25531) Unreverts #24812, skipping the memory releasing tests that are already flaky. We have a separate issue tracking the unskipping of these memory releasing tests, once we find a more reliable way to test them. * Revert "Revert "Revert "Revert "[Datasets] [Tensor Story - 1/2] Automatically provide tensor views to UDFs and infer tensor blocks for pure-tensor datasets."" (#25031)" (#25057)" This reverts commit fb2933a. * Skip shuffle memory release test.

…ovide tensor views to UDFs and infer tensor blocks for pure-tensor datasets. (ray-project#25031)" (ray-project#25531) Unreverts ray-project#24812, skipping the memory releasing tests that are already flaky. We have a separate issue tracking the unskipping of these memory releasing tests, once we find a more reliable way to test them. * Revert "Revert "Revert "Revert "[Datasets] [Tensor Story - 1/2] Automatically provide tensor views to UDFs and infer tensor blocks for pure-tensor datasets."" (ray-project#25031)" (ray-project#25057)" This reverts commit fb2933a. * Skip shuffle memory release test.

…x random access dataset. Cherrypick ray-project#24812 Need the commit ray-project@f3a2ac4 to fix Arrow bugs. Exclude changes of _build_tensor_row() and random_access_dataset.py. Co-Authored-By: Clark Zinzow <[email protected]>

clarkzinzow requested review from ericl, scv119 and jjyao as code owners May 15, 2022 02:25

clarkzinzow assigned ericl, xwjiang2010 and jianoaix May 15, 2022

clarkzinzow changed the title ~~[Datasets] Better tensor support.~~ [Datasets] Improve tensor support in Datasets. May 15, 2022

clarkzinzow force-pushed the datasets/feat/better-tensor-support branch from d948851 to 03d4a34 Compare May 15, 2022 03:54

ericl requested changes May 15, 2022

View reviewed changes

python/ray/data/dataset.py Show resolved Hide resolved

python/ray/data/dataset.py Outdated Show resolved Hide resolved

python/ray/data/dataset.py Outdated Show resolved Hide resolved

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label May 15, 2022

clarkzinzow force-pushed the datasets/feat/better-tensor-support branch from 927bb98 to f0e6267 Compare May 17, 2022 00:30

xwjiang2010 mentioned this pull request May 17, 2022

[WIP] [air] Batch inference on images - tensorflow #24759

Closed

6 tasks

clarkzinzow force-pushed the datasets/feat/better-tensor-support branch from f0e6267 to 774ea2a Compare May 17, 2022 00:56

clarkzinzow force-pushed the datasets/feat/better-tensor-support branch from ae9be8d to 497c812 Compare May 17, 2022 01:53