Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Datasets] Add .iter_torch_batches() and .iter_tf_batches() APIs. #26689

Conversation

clarkzinzow
Copy link
Contributor

This PR adds .iter_torch_batches() and .iter_tf_batches() convenience APIs, which takes care of ML framework tensor conversion, the narrow tensor waste for the .iter_batches() call ("numpy" format), and unifies batch formats around two options: a single tensor for simple/pure-tensor/single-column datasets, and a dictionary of tensors for multi-column datasets.

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@amogkam
Copy link
Contributor

amogkam commented Jul 18, 2022

Can we consolidate with the implementation in TorchPredictor and TFPredictor? It would be great if we can make sure there is consistency between the two. The same dataset that works for training should also work for prediction and vice versa.

@clarkzinzow
Copy link
Contributor Author

@amogkam Ah yep, I thought that the consistency guarantee would be trivial, I wasn't aware that we're unsqueezing unit-dimension tensors by default.

Could this consistency guarantee be satisfied by an integration test covering the main cases, or do you think we need to factor out the tensor conversion + batch formatting for DLPredictor/TorchPredictor/TFPredictor/etc. and .iter_torch_batches()/.iter_tf_batches() into shared code? The latter might be a bit ugly/involved.

@amogkam
Copy link
Contributor

amogkam commented Jul 19, 2022

Ah yeah integration test would be great!

The refactoring hopefully shouldn't be too bad? I think it should be extracting out to a common function. I'd be happy to pair on this with you tomorrow.

@richardliaw
Copy link
Contributor

richardliaw commented Jul 19, 2022

(just a reminder here) make sure this replaces all usage of to_torch and to_tf -- could be in another PR!

python/ray/data/dataset_pipeline.py Outdated Show resolved Hide resolved
python/ray/data/dataset.py Outdated Show resolved Hide resolved
python/ray/data/dataset.py Outdated Show resolved Hide resolved
python/ray/data/dataset.py Outdated Show resolved Hide resolved
python/ray/data/dataset.py Outdated Show resolved Hide resolved
@clarkzinzow clarkzinzow force-pushed the datasets/feat/iter-ml-tensor-batches branch from e3a38b5 to cf8ff65 Compare July 19, 2022 20:06
@clarkzinzow
Copy link
Contributor Author

@amogkam Refactored the NumPy batch --> DL tensor batch conversion a bit so we can share a maximal amount of conversion logic between training and prediction, PTAL!

Copy link
Contributor

@amogkam amogkam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @clarkzinzow, the predictor changes lgtm!

@clarkzinzow clarkzinzow requested a review from c21 July 19, 2022 21:48
Copy link
Contributor

@c21 c21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@clarkzinzow clarkzinzow force-pushed the datasets/feat/iter-ml-tensor-batches branch 2 times, most recently from 700bc65 to b69adcd Compare July 20, 2022 19:51

This iterator will yield single-tensor batches if the underlying dataset
consists of a single column; otherwise, it will yield a dictionary of
column-tensors. If looking for more flexibility in the tensor conversion (e.g.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the mention of to_torch (we should also mark it as @Deprecated), here and in other docs. Here are mentions we should change:

source/ray-air/check-ingest.rst
43:typically calls ``iter_batches``, ``to_tf``, or ``to_torch`` to iterate over the dataset reader retrieved by ``get_dataset_shard``.

source/ray-air/examples/torch_incremental_learning.ipynb
544:    "    dataset_shard = session.get_dataset_shard(\"train\").to_torch(\n",

source/ray-air/examples/torch_image_example.ipynb
145:    "Next, let's represent our data using pandas dataframes instead of tuples. This lets us call methods like {py:meth}`Dataset.to_torch <ray.data.Dataset.to_torch>` later in the tutorial."
283:    "* We call {py:func}`session.get_dataset_shard <ray.air.session.get_dataset_shard>` and {py:meth}`Dataset.to_torch <ray.data.Dataset.to_torch>` to convert a subset of our training data to a Torch dataset.\n",
305:    "    train_dataset_shard: torch.utils.data.Dataset = session.get_dataset_shard(\"train\").to_torch(\n",

source/train/user_guide.rst
454:        train_torch_dataset = train_data_shard.to_torch(
459:        validation_torch_dataset = validation_data_shard.to_torch(
1203:            .to_torch(batch_size=config["batch_size"])

source/data/accessing-datasets.rst
66:For ingestion into one or more Torch trainers, Datasets offers a :meth:`ds.to_torch()
67:<ray.data.Dataset.to_torch>` API that returns a
78:  operations in conjunction with :meth:`ds.to_torch() <ray.data.Dataset.to_torch>`
89:that we may want to split into separate tensors. By informing ``ds.to_torch()`` of the
110:each tensor). See the :meth:`.to_torch() API docs <ray.data.Dataset.to_torch>` for

source/data/pipelining-compute.rst
77:   Alternatively, you may consider local shuffle after converting to_tf() or to_torch(), if simple shuffle is sufficient.

source/data/dataset-tensor-support.rst
189::meth:`ds.to_torch() <ray.data.Dataset.to_torch>` and
219:    torch_ds = ds.to_torch(
269:``to_torch()`` a ``feature_columns=[["feature_1"], ["feature_2"]]`` argument in order to
271:``to_torch()``, if isolating single columns as in the ``"feature_1"`` + ``"feature_2"``
295:    torch_ds = ds.to_torch(

source/data/key-concepts.rst
51:A DatasetPipeline is an unified iterator over a (potentially infinite) sequence of Ray Datasets, each of which represents a *window* over the original data. Conceptually it is similar to a `Spark DStream <https://spark.apache.org/docs/latest/streaming-programming-guide.html#discretized-streams-dstreams>`__, but manages execution over a bounded amount of source data instead of an unbounded stream. Ray computes each dataset window on-demand and stitches their output together into a single logical data iterator. DatasetPipeline implements most of the same transformation and output methods as Datasets (e.g., map, filter, split, iter_rows, to_torch, etc.).

source/data/advanced-pipelines.rst
91:        for batch in pipe.to_torch():

source/data/dataset.rst
287:     - :meth:`ds.to_torch() <ray.data.Dataset.to_torch>`

source/data/memory-management.rst
71::meth:`ds.to_torch() <ray.data.Dataset.to_torch>`, etc.) or if

source/data/doc_code/key_concepts.py
75:    .to_torch()

source/data/doc_code/accessing_datasets.py
95:torch_ds: torch.utils.data.IterableDataset = ds.to_torch(batch_size=2)
125:torch_ds: torch.utils.data.IterableDataset = ds.to_torch(
221:        for batch in shard.to_torch(batch_size=256):

source/ray-core/_examples/datasets_train/datasets_train.py
290:   .to_torch_dataset()
418:    test_torch_dataset = test_dataset.to_torch(
440:        train_torch_dataset = train_dataset.to_torch(

Copy link
Contributor Author

@clarkzinzow clarkzinzow Jul 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed with this, but to keep this PR a manageable size I'm going to do this in a follow-up PR; I will remove mentions of .to_torch() and .to_tf() from the .iter_torch_batches() and .iter_tf_batches() docstrings in this PR, though.

python/ray/data/dataset.py Outdated Show resolved Hide resolved
python/ray/data/dataset.py Outdated Show resolved Hide resolved
python/ray/data/dataset.py Show resolved Hide resolved
python/ray/air/_internal/tensorflow_utils.py Outdated Show resolved Hide resolved
python/ray/air/_internal/tensorflow_utils.py Outdated Show resolved Hide resolved
else:
# Multi-tensor case.
batch = {
col_name: convert_ndarray_to_tf_tensor(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we still auto-unsqueeze in this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

python/ray/air/_internal/torch_utils.py Outdated Show resolved Hide resolved
@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jul 20, 2022
Copy link
Contributor

@c21 c21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG from my side.

@clarkzinzow
Copy link
Contributor Author

@ericl @jianoaix Ping for code owner approval!


Returns: A TensorFlow Tensor.
"""
return tf.convert_to_tensor(ndarray, dtype=dtype)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just call it directly?

Copy link
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm, but we could remove one of the util methods now.

@clarkzinzow
Copy link
Contributor Author

@ericl Sounds good, I'll remove those!

torch.Size([4, 1])
torch.Size([4, 1])

Time complexity: O(1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a loop in impl, is this really O(1)? Are those time complexity actually useful?

@clarkzinzow clarkzinzow force-pushed the datasets/feat/iter-ml-tensor-batches branch from fcfb20d to 3d02aa6 Compare July 21, 2022 19:52
@clarkzinzow
Copy link
Contributor Author

Test failures appear to be unrelated, merging.

@clarkzinzow clarkzinzow merged commit a29baf9 into ray-project:master Jul 22, 2022
Rohan138 pushed a commit to Rohan138/ray that referenced this pull request Jul 28, 2022
…ray-project#26689)

This PR adds .iter_torch_batches() and .iter_tf_batches() convenience APIs, which takes care of ML framework tensor conversion, the narrow tensor waste for the .iter_batches() call ("numpy" format), and unifies batch formats around two options: a single tensor for simple/pure-tensor/single-column datasets, and a dictionary of tensors for multi-column datasets.

Signed-off-by: Rohan138 <[email protected]>
Stefan-1313 pushed a commit to Stefan-1313/ray_mod that referenced this pull request Aug 18, 2022
…ray-project#26689)

This PR adds .iter_torch_batches() and .iter_tf_batches() convenience APIs, which takes care of ML framework tensor conversion, the narrow tensor waste for the .iter_batches() call ("numpy" format), and unifies batch formats around two options: a single tensor for simple/pure-tensor/single-column datasets, and a dictionary of tensors for multi-column datasets.

Signed-off-by: Stefan van der Kleij <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants