-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Datasets] Add .iter_torch_batches()
and .iter_tf_batches()
APIs.
#26689
[Datasets] Add .iter_torch_batches()
and .iter_tf_batches()
APIs.
#26689
Conversation
Can we consolidate with the implementation in TorchPredictor and TFPredictor? It would be great if we can make sure there is consistency between the two. The same dataset that works for training should also work for prediction and vice versa. |
@amogkam Ah yep, I thought that the consistency guarantee would be trivial, I wasn't aware that we're unsqueezing unit-dimension tensors by default. Could this consistency guarantee be satisfied by an integration test covering the main cases, or do you think we need to factor out the tensor conversion + batch formatting for |
Ah yeah integration test would be great! The refactoring hopefully shouldn't be too bad? I think it should be extracting out to a common function. I'd be happy to pair on this with you tomorrow. |
(just a reminder here) make sure this replaces all usage of to_torch and to_tf -- could be in another PR! |
e3a38b5
to
cf8ff65
Compare
@amogkam Refactored the NumPy batch --> DL tensor batch conversion a bit so we can share a maximal amount of conversion logic between training and prediction, PTAL! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @clarkzinzow, the predictor changes lgtm!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
700bc65
to
b69adcd
Compare
|
||
This iterator will yield single-tensor batches if the underlying dataset | ||
consists of a single column; otherwise, it will yield a dictionary of | ||
column-tensors. If looking for more flexibility in the tensor conversion (e.g. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove the mention of to_torch (we should also mark it as @Deprecated
), here and in other docs. Here are mentions we should change:
source/ray-air/check-ingest.rst
43:typically calls ``iter_batches``, ``to_tf``, or ``to_torch`` to iterate over the dataset reader retrieved by ``get_dataset_shard``.
source/ray-air/examples/torch_incremental_learning.ipynb
544: " dataset_shard = session.get_dataset_shard(\"train\").to_torch(\n",
source/ray-air/examples/torch_image_example.ipynb
145: "Next, let's represent our data using pandas dataframes instead of tuples. This lets us call methods like {py:meth}`Dataset.to_torch <ray.data.Dataset.to_torch>` later in the tutorial."
283: "* We call {py:func}`session.get_dataset_shard <ray.air.session.get_dataset_shard>` and {py:meth}`Dataset.to_torch <ray.data.Dataset.to_torch>` to convert a subset of our training data to a Torch dataset.\n",
305: " train_dataset_shard: torch.utils.data.Dataset = session.get_dataset_shard(\"train\").to_torch(\n",
source/train/user_guide.rst
454: train_torch_dataset = train_data_shard.to_torch(
459: validation_torch_dataset = validation_data_shard.to_torch(
1203: .to_torch(batch_size=config["batch_size"])
source/data/accessing-datasets.rst
66:For ingestion into one or more Torch trainers, Datasets offers a :meth:`ds.to_torch()
67:<ray.data.Dataset.to_torch>` API that returns a
78: operations in conjunction with :meth:`ds.to_torch() <ray.data.Dataset.to_torch>`
89:that we may want to split into separate tensors. By informing ``ds.to_torch()`` of the
110:each tensor). See the :meth:`.to_torch() API docs <ray.data.Dataset.to_torch>` for
source/data/pipelining-compute.rst
77: Alternatively, you may consider local shuffle after converting to_tf() or to_torch(), if simple shuffle is sufficient.
source/data/dataset-tensor-support.rst
189::meth:`ds.to_torch() <ray.data.Dataset.to_torch>` and
219: torch_ds = ds.to_torch(
269:``to_torch()`` a ``feature_columns=[["feature_1"], ["feature_2"]]`` argument in order to
271:``to_torch()``, if isolating single columns as in the ``"feature_1"`` + ``"feature_2"``
295: torch_ds = ds.to_torch(
source/data/key-concepts.rst
51:A DatasetPipeline is an unified iterator over a (potentially infinite) sequence of Ray Datasets, each of which represents a *window* over the original data. Conceptually it is similar to a `Spark DStream <https://spark.apache.org/docs/latest/streaming-programming-guide.html#discretized-streams-dstreams>`__, but manages execution over a bounded amount of source data instead of an unbounded stream. Ray computes each dataset window on-demand and stitches their output together into a single logical data iterator. DatasetPipeline implements most of the same transformation and output methods as Datasets (e.g., map, filter, split, iter_rows, to_torch, etc.).
source/data/advanced-pipelines.rst
91: for batch in pipe.to_torch():
source/data/dataset.rst
287: - :meth:`ds.to_torch() <ray.data.Dataset.to_torch>`
source/data/memory-management.rst
71::meth:`ds.to_torch() <ray.data.Dataset.to_torch>`, etc.) or if
source/data/doc_code/key_concepts.py
75: .to_torch()
source/data/doc_code/accessing_datasets.py
95:torch_ds: torch.utils.data.IterableDataset = ds.to_torch(batch_size=2)
125:torch_ds: torch.utils.data.IterableDataset = ds.to_torch(
221: for batch in shard.to_torch(batch_size=256):
source/ray-core/_examples/datasets_train/datasets_train.py
290: .to_torch_dataset()
418: test_torch_dataset = test_dataset.to_torch(
440: train_torch_dataset = train_dataset.to_torch(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed with this, but to keep this PR a manageable size I'm going to do this in a follow-up PR; I will remove mentions of .to_torch()
and .to_tf()
from the .iter_torch_batches()
and .iter_tf_batches()
docstrings in this PR, though.
else: | ||
# Multi-tensor case. | ||
batch = { | ||
col_name: convert_ndarray_to_tf_tensor( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we still auto-unsqueeze in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @amogkam
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LG from my side.
5311506
to
85a8f42
Compare
|
||
Returns: A TensorFlow Tensor. | ||
""" | ||
return tf.convert_to_tensor(ndarray, dtype=dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just call it directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lgtm, but we could remove one of the util methods now.
@ericl Sounds good, I'll remove those! |
torch.Size([4, 1]) | ||
torch.Size([4, 1]) | ||
|
||
Time complexity: O(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a loop in impl, is this really O(1)? Are those time complexity actually useful?
fcfb20d
to
3d02aa6
Compare
Test failures appear to be unrelated, merging. |
…ray-project#26689) This PR adds .iter_torch_batches() and .iter_tf_batches() convenience APIs, which takes care of ML framework tensor conversion, the narrow tensor waste for the .iter_batches() call ("numpy" format), and unifies batch formats around two options: a single tensor for simple/pure-tensor/single-column datasets, and a dictionary of tensors for multi-column datasets. Signed-off-by: Rohan138 <[email protected]>
…ray-project#26689) This PR adds .iter_torch_batches() and .iter_tf_batches() convenience APIs, which takes care of ML framework tensor conversion, the narrow tensor waste for the .iter_batches() call ("numpy" format), and unifies batch formats around two options: a single tensor for simple/pure-tensor/single-column datasets, and a dictionary of tensors for multi-column datasets. Signed-off-by: Stefan van der Kleij <[email protected]>
This PR adds
.iter_torch_batches()
and.iter_tf_batches()
convenience APIs, which takes care of ML framework tensor conversion, the narrow tensor waste for the.iter_batches()
call ("numpy"
format), and unifies batch formats around two options: a single tensor for simple/pure-tensor/single-column datasets, and a dictionary of tensors for multi-column datasets.Checks
scripts/format.sh
to lint the changes in this PR.