[Datasets] [WIP] Prototype wrapping operation-based lazy compute model. #22268

clarkzinzow · 2022-02-10T01:21:42Z

This PR contains a rough prototype of a lazy compute model for Datasets, implemented via wrapping the existing eagerly-computed Datasets abstractions with lazy proxies. After switching to lazy mode (via a ds.lazy() call or a lazy=True arg to any dataset-creating APIs), all future operations are lazy, building up an operation graph that won't be executed until .compute() or a consuming method (e.g. .show(), .iter_batches(), .to_torch(), etc.) is called.

import ray

ds = ray.data.range(100)
ds = ds.lazy()  # Comment me out.
# ds is now a LazyDataset

print("# Map")
# It exposes the same API as Dataset.
ds = ds.map(lambda x: x + 1)
print("# Shuffle")
ds = ds.random_shuffle()
print("# Show")
# A consuming function like .show() triggers execution.
ds.show()

I'm pushing this PR up for reference; we will almost surely be going with the operation-based lazy compute model in #22233 since it provides a clearer path to implementing (1) move semantics for blocks within operations, and (2) optimizations such as task fusion.

The big pros of this PR are:

The lazy compute model is implemented as a wrapper of Dataset, DatasetPipeline, and GroupedDataset; the only changes to those abstractions as the addition of the .lazy() API. The fact that this prototype doesn't touch the core Datasets abstractions makes it very easy to add and later remove.
.split() and .split_at_indices() are still lazy, and support automatic caching of the materialized pre-split dataset (pipeline) and triggering of execution by any of the consumers. This is done by parking the pre-split dataset (pipeline) in an actor, where all downstream operations of the split reference a split fetch to the actor, the first of which will trigger execution of the dataset on the actor, after which the dataset will be cached. The lifetime of this actor is tied to the lifetime of the underlying dataset, so the cached dataset should be released once it's no longer needed by any of the splits.
The materialized caching is also exposed to the user via a ds.cache() method, which can be used before a branching computation, even if the branches are then passed to Ray tasks.
Easy, low-code addition of fully lazy dataset creation to all read/creation APIs.

Hopefully we can pull a subset of these pros that makes sense into the other prototype.

clarkzinzow · 2022-02-10T01:22:25Z

cc @ericl for reference

clarkzinzow · 2022-02-10T01:26:25Z

python/ray/data/lazy_dataset.py

+        width=80, initial_indent=indent, subsequent_indent=indent + "    "
+    )
+    lines.insert(i, wrapper.fill(lazy_arg_doc))
+    return "\n".join(lines)


Haven't tested this arg docstring injection, but the general approach seems doable.

Prototype of wrapping, operation-based lazy compute model.

7a97098

clarkzinzow requested review from ericl and scv119 as code owners February 10, 2022 01:21

clarkzinzow commented Feb 10, 2022

View reviewed changes

clarkzinzow added the do-not-merge Do not merge this PR! label Feb 10, 2022

clarkzinzow closed this Feb 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Datasets] [WIP] Prototype wrapping operation-based lazy compute model. #22268

[Datasets] [WIP] Prototype wrapping operation-based lazy compute model. #22268

clarkzinzow commented Feb 10, 2022

clarkzinzow commented Feb 10, 2022

clarkzinzow Feb 10, 2022

[Datasets] [WIP] Prototype wrapping operation-based lazy compute model. #22268

[Datasets] [WIP] Prototype wrapping operation-based lazy compute model. #22268

Conversation

clarkzinzow commented Feb 10, 2022

clarkzinzow commented Feb 10, 2022

clarkzinzow Feb 10, 2022

Choose a reason for hiding this comment