Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Datasets] Simplify to_tf interface #29028

Merged
merged 16 commits into from
Oct 7, 2022

Conversation

bveeramani
Copy link
Member

@bveeramani bveeramani commented Oct 4, 2022

Signed-off-by: Balaji Veeramani [email protected]

Why are these changes needed?

The TensorFlow UX is bad because to_tf is confusing and iter_tf_batches is boilerplate-y.

  1. to_tf automatically concatenates and unsqueezes columns for the user. The original motivation was to let users convert tabular datasets to TensorFlow datasets. But, the semantics are complicated, and it often doesn't work the way you want it to. More importantly, the concatenation and unsqueezing functionality is obsolete with the introduction of preprocessors.Concatenator.

  2. The currently recommended iter_tf_batches requires lots of boilerplate. You need to specify the output signature, call prepare_dataset_shard, and yield tensors from iter_tf_batchse. We can improve the UX by eliminating calling prepare_dataset_shard for the user and inferring the output_signature from the dataset schema.

Before

def to_tf_dataset(dataset, batch_size):
    def to_tensor_iterator():
        data_iterator = dataset.iter_tf_batches(
            batch_size=batch_size, dtypes=tf.float32
        )
        for d in data_iterator:
            # "concat_out" is the output column of the Concatenator.
            yield d["concat_out"], d["target"]

    output_signature = (
        tf.TensorSpec(shape=(None, num_features), dtype=tf.float32),
        tf.TensorSpec(shape=(None), dtype=tf.float32),
    )
    tf_dataset = tf.data.Dataset.from_generator(
        to_tensor_iterator, output_signature=output_signature
    )
    return prepare_dataset_shard(tf_dataset)

to_tf_dataset(dataset, batch_size=64)

After

dataset.to_tf("concat_out", "target", batch_size=64)

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

python/ray/data/dataset.py Show resolved Hide resolved
python/ray/data/dataset.py Outdated Show resolved Hide resolved
@@ -0,0 +1,140 @@
import pytest
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add this to BUILD file.

Copy link
Member Author

@bveeramani bveeramani Oct 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like test_dataset_tf is already running in CI. Maybe it's captured by this?

py_test_module_list(
files = glob(
include=["tests/test_*.py"],
exclude=[
"tests/test_preprocessors.py",
"tests/test_dataset_formats.py",
],
),
size = "large",
tags = ["team:core", "exclusive"],
deps = ["//:ray_lib", ":conftest"],
)

Copy link
Contributor

@clarkzinzow clarkzinzow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall, mostly small nits on error handling and test coverage.

python/ray/air/_internal/tensorflow_utils.py Outdated Show resolved Hide resolved
python/ray/air/_internal/tensorflow_utils.py Outdated Show resolved Hide resolved
python/ray/data/dataset.py Outdated Show resolved Hide resolved
python/ray/data/dataset.py Outdated Show resolved Hide resolved
python/ray/data/dataset.py Outdated Show resolved Hide resolved
python/ray/data/dataset.py Show resolved Hide resolved
python/ray/data/dataset.py Outdated Show resolved Hide resolved
python/ray/data/tests/test_dataset_tf.py Show resolved Hide resolved


@pytest.mark.parametrize("pipelined", [False, True])
def test_tensors_in_tables_to_tf_variable_shaped(ray_start_regular_shared, pipelined):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that the variable-shaped tensor column (simple ragged tensors) PR is unreverted, could we add test coverage of variable-shaped tensors back? #29071

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

python/ray/air/_internal/tensorflow_utils.py Outdated Show resolved Hide resolved
@matthewdeng matthewdeng added this to the Ray 2.1 milestone Oct 7, 2022
@c21 c21 added the Ray 2.1 label Oct 7, 2022
Copy link
Contributor

@clarkzinzow clarkzinzow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with the two changes given!

python/ray/data/dataset.py Outdated Show resolved Hide resolved
python/ray/data/dataset.py Show resolved Hide resolved
@clarkzinzow
Copy link
Contributor

All relevant tests are passing, merging!

@clarkzinzow clarkzinzow merged commit eb3f554 into ray-project:master Oct 7, 2022
amogkam pushed a commit that referenced this pull request Nov 3, 2022
Signed-off-by: Balaji [email protected]

to_tf is preferred over iter_tf_batches. For context, see #29028 (tl;dr: iter_tf_batches is too boilerplate-y).
WeichenXu123 pushed a commit to WeichenXu123/ray that referenced this pull request Dec 19, 2022
The TensorFlow UX is bad because to_tf is confusing and iter_tf_batches is boilerplate-y.

to_tf automatically concatenates and unsqueezes columns for the user. The original motivation was to let users convert tabular datasets to TensorFlow datasets. But, the semantics are complicated, and it often doesn't work the way you want it to. More importantly, the concatenation and unsqueezing functionality is obsolete with the introduction of preprocessors.Concatenator.

The currently recommended iter_tf_batches requires lots of boilerplate. You need to specify the output signature, call prepare_dataset_shard, and yield tensors from iter_tf_batchse. We can improve the UX by eliminating calling prepare_dataset_shard for the user and inferring the output_signature from the dataset schema.

Signed-off-by: Weichen Xu <[email protected]>
WeichenXu123 pushed a commit to WeichenXu123/ray that referenced this pull request Dec 19, 2022
…t#29462)

Signed-off-by: Balaji [email protected]

to_tf is preferred over iter_tf_batches. For context, see ray-project#29028 (tl;dr: iter_tf_batches is too boilerplate-y).

Signed-off-by: Weichen Xu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants