[Datasets] Simplify `to_tf` interface #29028

bveeramani · 2022-10-04T03:41:14Z

Signed-off-by: Balaji Veeramani [email protected]

Why are these changes needed?

The TensorFlow UX is bad because to_tf is confusing and iter_tf_batches is boilerplate-y.

to_tf automatically concatenates and unsqueezes columns for the user. The original motivation was to let users convert tabular datasets to TensorFlow datasets. But, the semantics are complicated, and it often doesn't work the way you want it to. More importantly, the concatenation and unsqueezing functionality is obsolete with the introduction of preprocessors.Concatenator.
The currently recommended iter_tf_batches requires lots of boilerplate. You need to specify the output signature, call prepare_dataset_shard, and yield tensors from iter_tf_batchse. We can improve the UX by eliminating calling prepare_dataset_shard for the user and inferring the output_signature from the dataset schema.

Before

def to_tf_dataset(dataset, batch_size):
    def to_tensor_iterator():
        data_iterator = dataset.iter_tf_batches(
            batch_size=batch_size, dtypes=tf.float32
        )
        for d in data_iterator:
            # "concat_out" is the output column of the Concatenator.
            yield d["concat_out"], d["target"]

    output_signature = (
        tf.TensorSpec(shape=(None, num_features), dtype=tf.float32),
        tf.TensorSpec(shape=(None), dtype=tf.float32),
    )
    tf_dataset = tf.data.Dataset.from_generator(
        to_tensor_iterator, output_signature=output_signature
    )
    return prepare_dataset_shard(tf_dataset)

to_tf_dataset(dataset, batch_size=64)

After

dataset.to_tf("concat_out", "target", batch_size=64)

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

python/ray/data/dataset.py

matthewdeng · 2022-10-04T04:35:40Z

python/ray/data/tests/test_dataset_tf.py

@@ -0,0 +1,140 @@
+import pytest


Add this to BUILD file.

Looks like test_dataset_tf is already running in CI. Maybe it's captured by this?

ray/python/ray/data/BUILD

Lines 31 to 42 in 1bd3f94

py_test_module_list(

files = glob(

include=["tests/test_*.py"],

exclude=[

"tests/test_preprocessors.py",

"tests/test_dataset_formats.py",

],

),

size = "large",

tags = ["team:core", "exclusive"],

deps = ["//:ray_lib", ":conftest"],

)

python/ray/air/_internal/tensorflow_utils.py

clarkzinzow

LGTM overall, mostly small nits on error handling and test coverage.

python/ray/air/_internal/tensorflow_utils.py

python/ray/data/dataset.py

python/ray/data/tests/test_dataset_tf.py

clarkzinzow · 2022-10-05T19:54:26Z

python/ray/data/tests/test_dataset.py

-
-
-@pytest.mark.parametrize("pipelined", [False, True])
-def test_tensors_in_tables_to_tf_variable_shaped(ray_start_regular_shared, pipelined):


Now that the variable-shaped tensor column (simple ragged tensors) PR is unreverted, could we add test coverage of variable-shaped tensors back? #29071

Discussed offline. We'll do this in follow-up PRs:

[Datasets] Make ragged tensor shape more descriptive #29135

[Datasets] Make to_tf work with ragged tensors #29136

python/ray/air/_internal/tensorflow_utils.py

Co-authored-by: Clark Zinzow <[email protected]> Signed-off-by: Balaji Veeramani <[email protected]>

clarkzinzow

LGTM with the two changes given!

python/ray/data/dataset.py

stale

Signed-off-by: Clark Zinzow <[email protected]>

python/ray/data/dataset.py

Signed-off-by: Clark Zinzow <[email protected]>

clarkzinzow · 2022-10-07T20:56:22Z

All relevant tests are passing, merging!

Signed-off-by: Balaji [email protected] to_tf is preferred over iter_tf_batches. For context, see #29028 (tl;dr: iter_tf_batches is too boilerplate-y).

The TensorFlow UX is bad because to_tf is confusing and iter_tf_batches is boilerplate-y. to_tf automatically concatenates and unsqueezes columns for the user. The original motivation was to let users convert tabular datasets to TensorFlow datasets. But, the semantics are complicated, and it often doesn't work the way you want it to. More importantly, the concatenation and unsqueezing functionality is obsolete with the introduction of preprocessors.Concatenator. The currently recommended iter_tf_batches requires lots of boilerplate. You need to specify the output signature, call prepare_dataset_shard, and yield tensors from iter_tf_batchse. We can improve the UX by eliminating calling prepare_dataset_shard for the user and inferring the output_signature from the dataset schema. Signed-off-by: Weichen Xu <[email protected]>

…t#29462) Signed-off-by: Balaji [email protected] to_tf is preferred over iter_tf_batches. For context, see ray-project#29028 (tl;dr: iter_tf_batches is too boilerplate-y). Signed-off-by: Weichen Xu <[email protected]>

Initial commit

442474e

bveeramani requested review from ericl, scv119, clarkzinzow, jjyao, jianoaix and c21 as code owners October 4, 2022 03:41

bveeramani assigned jiaodong, clarkzinzow and amogkam Oct 4, 2022

Delete repro.py

20d432e

bveeramani mentioned this pull request Oct 4, 2022

[Datasets] Simplify to_tf interface #28861

Closed

7 tasks

Update test_dataset_tf.py

1f656ca

matthewdeng previously requested changes Oct 4, 2022

View reviewed changes

This was referenced Oct 4, 2022

[RFC] [Datasets] Extend to_tf to provide more flexibility #29030

Closed

[Datasets] Deprecate iter_tf_batches in favor of to_tf #29031

Closed

Fix CI

ed0af4b

bveeramani commented Oct 4, 2022

View reviewed changes

python/ray/air/_internal/tensorflow_utils.py Outdated Show resolved Hide resolved

clarkzinzow reviewed Oct 5, 2022

View reviewed changes

bveeramani and others added 10 commits October 5, 2022 20:32

Update python/ray/air/_internal/tensorflow_utils.py

d30f3f4

Co-authored-by: Clark Zinzow <[email protected]> Signed-off-by: Balaji Veeramani <[email protected]>

Update python/ray/air/_internal/tensorflow_utils.py

418e4f2

Co-authored-by: Clark Zinzow <[email protected]> Signed-off-by: Balaji Veeramani <[email protected]>

Update python/ray/data/dataset.py

4473ea0

Co-authored-by: Clark Zinzow <[email protected]> Signed-off-by: Balaji Veeramani <[email protected]>

Update python/ray/data/dataset.py

80b00c2

Co-authored-by: Clark Zinzow <[email protected]> Signed-off-by: Balaji Veeramani <[email protected]>

Address review comment

c6009ff

Address review comments

f18730a

Appease lint

795841a

Update dataset.py

4718b10

Update dataset.py

30beced

Merge branch 'master' into bveeramani/to-tf2

6636592

matthewdeng added this to the Ray 2.1 milestone Oct 7, 2022

c21 added the Ray 2.1 label Oct 7, 2022

clarkzinzow approved these changes Oct 7, 2022

View reviewed changes

python/ray/data/dataset.py Outdated Show resolved Hide resolved

python/ray/data/dataset.py Show resolved Hide resolved

Apply suggestions from code review

f20fd5a

Signed-off-by: Clark Zinzow <[email protected]>

clarkzinzow reviewed Oct 7, 2022

View reviewed changes

python/ray/data/dataset.py Outdated Show resolved Hide resolved

Update python/ray/data/dataset.py

b60a36b

Signed-off-by: Clark Zinzow <[email protected]>

clarkzinzow merged commit eb3f554 into ray-project:master Oct 7, 2022

This was referenced Oct 19, 2022

[Datasets] Replace iter_tf_batches with to_tf in docs #29462

Merged

[Datasets] Improve to_tf docstring #29464

Merged

amogkam pushed a commit that referenced this pull request Nov 3, 2022

[Datasets] Replace iter_tf_batches with to_tf in docs (#29462)

b0db381

Signed-off-by: Balaji [email protected] to_tf is preferred over iter_tf_batches. For context, see #29028 (tl;dr: iter_tf_batches is too boilerplate-y).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Datasets] Simplify `to_tf` interface #29028

[Datasets] Simplify `to_tf` interface #29028

bveeramani commented Oct 4, 2022 •

edited

Loading

matthewdeng Oct 4, 2022

bveeramani Oct 4, 2022 •

edited

Loading

clarkzinzow left a comment

clarkzinzow Oct 5, 2022

bveeramani Oct 6, 2022

clarkzinzow left a comment

clarkzinzow commented Oct 7, 2022

	py_test_module_list(
	files = glob(
	include=["tests/test_*.py"],
	exclude=[
	"tests/test_preprocessors.py",
	"tests/test_dataset_formats.py",
	],
	),
	size = "large",
	tags = ["team:core", "exclusive"],
	deps = ["//:ray_lib", ":conftest"],
	)



		@pytest.mark.parametrize("pipelined", [False, True])
		def test_tensors_in_tables_to_tf_variable_shaped(ray_start_regular_shared, pipelined):

[Datasets] Simplify to_tf interface #29028

[Datasets] Simplify to_tf interface #29028

Conversation

bveeramani commented Oct 4, 2022 • edited Loading

Why are these changes needed?

Related issue number

Checks

matthewdeng Oct 4, 2022

Choose a reason for hiding this comment

bveeramani Oct 4, 2022 • edited Loading

Choose a reason for hiding this comment

clarkzinzow left a comment

Choose a reason for hiding this comment

clarkzinzow Oct 5, 2022

Choose a reason for hiding this comment

bveeramani Oct 6, 2022

Choose a reason for hiding this comment

clarkzinzow left a comment

Choose a reason for hiding this comment

clarkzinzow commented Oct 7, 2022

[Datasets] Simplify `to_tf` interface #29028

[Datasets] Simplify `to_tf` interface #29028

bveeramani commented Oct 4, 2022 •

edited

Loading

bveeramani Oct 4, 2022 •

edited

Loading