[Datasets] Add `.iter_torch_batches()` and `.iter_tf_batches()` APIs. #26689

clarkzinzow · 2022-07-18T23:38:30Z

This PR adds .iter_torch_batches() and .iter_tf_batches() convenience APIs, which takes care of ML framework tensor conversion, the narrow tensor waste for the .iter_batches() call ("numpy" format), and unifies batch formats around two options: a single tensor for simple/pure-tensor/single-column datasets, and a dictionary of tensors for multi-column datasets.

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

amogkam · 2022-07-18T23:44:13Z

Can we consolidate with the implementation in TorchPredictor and TFPredictor? It would be great if we can make sure there is consistency between the two. The same dataset that works for training should also work for prediction and vice versa.

clarkzinzow · 2022-07-19T00:12:55Z

@amogkam Ah yep, I thought that the consistency guarantee would be trivial, I wasn't aware that we're unsqueezing unit-dimension tensors by default.

Could this consistency guarantee be satisfied by an integration test covering the main cases, or do you think we need to factor out the tensor conversion + batch formatting for DLPredictor/TorchPredictor/TFPredictor/etc. and .iter_torch_batches()/.iter_tf_batches() into shared code? The latter might be a bit ugly/involved.

amogkam · 2022-07-19T00:37:17Z

Ah yeah integration test would be great!

The refactoring hopefully shouldn't be too bad? I think it should be extracting out to a common function. I'd be happy to pair on this with you tomorrow.

richardliaw · 2022-07-19T18:10:16Z

(just a reminder here) make sure this replaces all usage of to_torch and to_tf -- could be in another PR!

python/ray/data/dataset_pipeline.py

python/ray/data/dataset.py

clarkzinzow · 2022-07-19T20:09:38Z

@amogkam Refactored the NumPy batch --> DL tensor batch conversion a bit so we can share a maximal amount of conversion logic between training and prediction, PTAL!

amogkam

Thanks @clarkzinzow, the predictor changes lgtm!

c21

LGTM

ericl · 2022-07-20T20:50:04Z

python/ray/data/dataset.py

+
+        This iterator will yield single-tensor batches if the underlying dataset
+        consists of a single column; otherwise, it will yield a dictionary of
+        column-tensors. If looking for more flexibility in the tensor conversion (e.g.


Please remove the mention of to_torch (we should also mark it as @Deprecated), here and in other docs. Here are mentions we should change:

source/ray-air/check-ingest.rst 43:typically calls ``iter_batches``, ``to_tf``, or ``to_torch`` to iterate over the dataset reader retrieved by ``get_dataset_shard``. source/ray-air/examples/torch_incremental_learning.ipynb 544: " dataset_shard = session.get_dataset_shard(\"train\").to_torch(\n", source/ray-air/examples/torch_image_example.ipynb 145: "Next, let's represent our data using pandas dataframes instead of tuples. This lets us call methods like {py:meth}`Dataset.to_torch <ray.data.Dataset.to_torch>` later in the tutorial." 283: "* We call {py:func}`session.get_dataset_shard <ray.air.session.get_dataset_shard>` and {py:meth}`Dataset.to_torch <ray.data.Dataset.to_torch>` to convert a subset of our training data to a Torch dataset.\n", 305: " train_dataset_shard: torch.utils.data.Dataset = session.get_dataset_shard(\"train\").to_torch(\n", source/train/user_guide.rst 454: train_torch_dataset = train_data_shard.to_torch( 459: validation_torch_dataset = validation_data_shard.to_torch( 1203: .to_torch(batch_size=config["batch_size"]) source/data/accessing-datasets.rst 66:For ingestion into one or more Torch trainers, Datasets offers a :meth:`ds.to_torch() 67:<ray.data.Dataset.to_torch>` API that returns a 78: operations in conjunction with :meth:`ds.to_torch() <ray.data.Dataset.to_torch>` 89:that we may want to split into separate tensors. By informing ``ds.to_torch()`` of the 110:each tensor). See the :meth:`.to_torch() API docs <ray.data.Dataset.to_torch>` for source/data/pipelining-compute.rst 77: Alternatively, you may consider local shuffle after converting to_tf() or to_torch(), if simple shuffle is sufficient. source/data/dataset-tensor-support.rst 189::meth:`ds.to_torch() <ray.data.Dataset.to_torch>` and 219: torch_ds = ds.to_torch( 269:``to_torch()`` a ``feature_columns=[["feature_1"], ["feature_2"]]`` argument in order to 271:``to_torch()``, if isolating single columns as in the ``"feature_1"`` + ``"feature_2"`` 295: torch_ds = ds.to_torch( source/data/key-concepts.rst 51:A DatasetPipeline is an unified iterator over a (potentially infinite) sequence of Ray Datasets, each of which represents a *window* over the original data. Conceptually it is similar to a `Spark DStream <https://spark.apache.org/docs/latest/streaming-programming-guide.html#discretized-streams-dstreams>`__, but manages execution over a bounded amount of source data instead of an unbounded stream. Ray computes each dataset window on-demand and stitches their output together into a single logical data iterator. DatasetPipeline implements most of the same transformation and output methods as Datasets (e.g., map, filter, split, iter_rows, to_torch, etc.). source/data/advanced-pipelines.rst 91: for batch in pipe.to_torch(): source/data/dataset.rst 287: - :meth:`ds.to_torch() <ray.data.Dataset.to_torch>` source/data/memory-management.rst 71::meth:`ds.to_torch() <ray.data.Dataset.to_torch>`, etc.) or if source/data/doc_code/key_concepts.py 75: .to_torch() source/data/doc_code/accessing_datasets.py 95:torch_ds: torch.utils.data.IterableDataset = ds.to_torch(batch_size=2) 125:torch_ds: torch.utils.data.IterableDataset = ds.to_torch( 221: for batch in shard.to_torch(batch_size=256): source/ray-core/_examples/datasets_train/datasets_train.py 290: .to_torch_dataset() 418: test_torch_dataset = test_dataset.to_torch( 440: train_torch_dataset = train_dataset.to_torch(

Agreed with this, but to keep this PR a manageable size I'm going to do this in a follow-up PR; I will remove mentions of .to_torch() and .to_tf() from the .iter_torch_batches() and .iter_tf_batches() docstrings in this PR, though.

python/ray/data/dataset.py

python/ray/air/_internal/tensorflow_utils.py

ericl · 2022-07-20T20:56:14Z

python/ray/air/_internal/tensorflow_utils.py

+    else:
+        # Multi-tensor case.
+        batch = {
+            col_name: convert_ndarray_to_tf_tensor(


Should we still auto-unsqueeze in this case?

cc @amogkam

python/ray/air/_internal/torch_utils.py

c21

LG from my side.

doc/source/ray-air/doc_code/predictors.py

clarkzinzow · 2022-07-21T18:20:17Z

@ericl @jianoaix Ping for code owner approval!

ericl · 2022-07-21T18:25:15Z

python/ray/air/_internal/tensorflow_utils.py

+
+    Returns: A TensorFlow Tensor.
+    """
+    return tf.convert_to_tensor(ndarray, dtype=dtype)


Can we just call it directly?

python/ray/air/_internal/torch_utils.py

ericl

Lgtm, but we could remove one of the util methods now.

clarkzinzow · 2022-07-21T18:26:32Z

@ericl Sounds good, I'll remove those!

jianoaix · 2022-07-21T19:46:14Z

python/ray/data/dataset.py

+            torch.Size([4, 1])
+            torch.Size([4, 1])
+
+        Time complexity: O(1)


There is a loop in impl, is this really O(1)? Are those time complexity actually useful?

clarkzinzow · 2022-07-22T17:09:16Z

Test failures appear to be unrelated, merging.

…ray-project#26689) This PR adds .iter_torch_batches() and .iter_tf_batches() convenience APIs, which takes care of ML framework tensor conversion, the narrow tensor waste for the .iter_batches() call ("numpy" format), and unifies batch formats around two options: a single tensor for simple/pure-tensor/single-column datasets, and a dictionary of tensors for multi-column datasets. Signed-off-by: Rohan138 <[email protected]>

…ray-project#26689) This PR adds .iter_torch_batches() and .iter_tf_batches() convenience APIs, which takes care of ML framework tensor conversion, the narrow tensor waste for the .iter_batches() call ("numpy" format), and unifies batch formats around two options: a single tensor for simple/pure-tensor/single-column datasets, and a dictionary of tensors for multi-column datasets. Signed-off-by: Stefan van der Kleij <[email protected]>

clarkzinzow requested review from ericl, scv119, jjyao and jianoaix as code owners July 18, 2022 23:38

c21 reviewed Jul 19, 2022

View reviewed changes

clarkzinzow force-pushed the datasets/feat/iter-ml-tensor-batches branch from e3a38b5 to cf8ff65 Compare July 19, 2022 20:06

amogkam reviewed Jul 19, 2022

View reviewed changes

clarkzinzow requested a review from c21 July 19, 2022 21:48

c21 approved these changes Jul 19, 2022

View reviewed changes

clarkzinzow assigned ericl and jianoaix Jul 20, 2022

clarkzinzow force-pushed the datasets/feat/iter-ml-tensor-batches branch 2 times, most recently from 700bc65 to b69adcd Compare July 20, 2022 19:51

ericl reviewed Jul 20, 2022

View reviewed changes

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jul 20, 2022

Add iter_torch_batches and iter_tf_batches APIs.

8d69cfd

c21 approved these changes Jul 21, 2022

View reviewed changes

clarkzinzow force-pushed the datasets/feat/iter-ml-tensor-batches branch from 5311506 to 85a8f42 Compare July 21, 2022 18:12

clarkzinzow requested review from maxpumperla, pcmoritz, richardliaw, edoakes and simon-mo as code owners July 21, 2022 18:12

richardliaw reviewed Jul 21, 2022

View reviewed changes

doc/source/ray-air/doc_code/predictors.py Show resolved Hide resolved

ericl reviewed Jul 21, 2022

View reviewed changes

python/ray/air/_internal/torch_utils.py Outdated Show resolved Hide resolved

ericl approved these changes Jul 21, 2022

View reviewed changes

jianoaix reviewed Jul 21, 2022

View reviewed changes

clarkzinzow added 4 commits July 21, 2022 19:52

PR feedback

fdb17d2

Fix TensorFlow examples.

603de3d

Add comment about adding feature dimension in TensorFlow examples.

38c3eb3

PR feedback

3d02aa6

clarkzinzow force-pushed the datasets/feat/iter-ml-tensor-batches branch from fcfb20d to 3d02aa6 Compare July 21, 2022 19:52

clarkzinzow added 2 commits July 22, 2022 00:59

Fix examples.

a838872

Fix tests

4c703d7

clarkzinzow merged commit a29baf9 into ray-project:master Jul 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Datasets] Add `.iter_torch_batches()` and `.iter_tf_batches()` APIs. #26689

[Datasets] Add `.iter_torch_batches()` and `.iter_tf_batches()` APIs. #26689

clarkzinzow commented Jul 18, 2022

amogkam commented Jul 18, 2022

clarkzinzow commented Jul 19, 2022

amogkam commented Jul 19, 2022

richardliaw commented Jul 19, 2022 •

edited

Loading

clarkzinzow commented Jul 19, 2022

amogkam left a comment

c21 left a comment

ericl Jul 20, 2022

clarkzinzow Jul 21, 2022 •

edited

Loading

ericl Jul 20, 2022

clarkzinzow Jul 20, 2022

c21 left a comment

clarkzinzow commented Jul 21, 2022

ericl Jul 21, 2022

ericl left a comment

clarkzinzow commented Jul 21, 2022

jianoaix Jul 21, 2022

clarkzinzow commented Jul 22, 2022

[Datasets] Add .iter_torch_batches() and .iter_tf_batches() APIs. #26689

[Datasets] Add .iter_torch_batches() and .iter_tf_batches() APIs. #26689

Conversation

clarkzinzow commented Jul 18, 2022

Checks

amogkam commented Jul 18, 2022

clarkzinzow commented Jul 19, 2022

amogkam commented Jul 19, 2022

richardliaw commented Jul 19, 2022 • edited Loading

clarkzinzow commented Jul 19, 2022

amogkam left a comment

Choose a reason for hiding this comment

c21 left a comment

Choose a reason for hiding this comment

ericl Jul 20, 2022

Choose a reason for hiding this comment

clarkzinzow Jul 21, 2022 • edited Loading

Choose a reason for hiding this comment

ericl Jul 20, 2022

Choose a reason for hiding this comment

clarkzinzow Jul 20, 2022

Choose a reason for hiding this comment

c21 left a comment

Choose a reason for hiding this comment

clarkzinzow commented Jul 21, 2022

ericl Jul 21, 2022

Choose a reason for hiding this comment

ericl left a comment

Choose a reason for hiding this comment

clarkzinzow commented Jul 21, 2022

jianoaix Jul 21, 2022

Choose a reason for hiding this comment

clarkzinzow commented Jul 22, 2022

[Datasets] Add `.iter_torch_batches()` and `.iter_tf_batches()` APIs. #26689

[Datasets] Add `.iter_torch_batches()` and `.iter_tf_batches()` APIs. #26689

richardliaw commented Jul 19, 2022 •

edited

Loading

clarkzinzow Jul 21, 2022 •

edited

Loading