-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Datasets] Simplify to_tf
interface
#29028
Conversation
@@ -0,0 +1,140 @@ | |||
import pytest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add this to BUILD
file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like test_dataset_tf
is already running in CI. Maybe it's captured by this?
Lines 31 to 42 in 1bd3f94
py_test_module_list( | |
files = glob( | |
include=["tests/test_*.py"], | |
exclude=[ | |
"tests/test_preprocessors.py", | |
"tests/test_dataset_formats.py", | |
], | |
), | |
size = "large", | |
tags = ["team:core", "exclusive"], | |
deps = ["//:ray_lib", ":conftest"], | |
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall, mostly small nits on error handling and test coverage.
|
||
|
||
@pytest.mark.parametrize("pipelined", [False, True]) | ||
def test_tensors_in_tables_to_tf_variable_shaped(ray_start_regular_shared, pipelined): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that the variable-shaped tensor column (simple ragged tensors) PR is unreverted, could we add test coverage of variable-shaped tensors back? #29071
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline. We'll do this in follow-up PRs:
Co-authored-by: Clark Zinzow <[email protected]> Signed-off-by: Balaji Veeramani <[email protected]>
Co-authored-by: Clark Zinzow <[email protected]> Signed-off-by: Balaji Veeramani <[email protected]>
Co-authored-by: Clark Zinzow <[email protected]> Signed-off-by: Balaji Veeramani <[email protected]>
Co-authored-by: Clark Zinzow <[email protected]> Signed-off-by: Balaji Veeramani <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with the two changes given!
Signed-off-by: Clark Zinzow <[email protected]>
Signed-off-by: Clark Zinzow <[email protected]>
All relevant tests are passing, merging! |
Signed-off-by: Balaji [email protected] to_tf is preferred over iter_tf_batches. For context, see #29028 (tl;dr: iter_tf_batches is too boilerplate-y).
The TensorFlow UX is bad because to_tf is confusing and iter_tf_batches is boilerplate-y. to_tf automatically concatenates and unsqueezes columns for the user. The original motivation was to let users convert tabular datasets to TensorFlow datasets. But, the semantics are complicated, and it often doesn't work the way you want it to. More importantly, the concatenation and unsqueezing functionality is obsolete with the introduction of preprocessors.Concatenator. The currently recommended iter_tf_batches requires lots of boilerplate. You need to specify the output signature, call prepare_dataset_shard, and yield tensors from iter_tf_batchse. We can improve the UX by eliminating calling prepare_dataset_shard for the user and inferring the output_signature from the dataset schema. Signed-off-by: Weichen Xu <[email protected]>
…t#29462) Signed-off-by: Balaji [email protected] to_tf is preferred over iter_tf_batches. For context, see ray-project#29028 (tl;dr: iter_tf_batches is too boilerplate-y). Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Balaji Veeramani [email protected]
Why are these changes needed?
The TensorFlow UX is bad because
to_tf
is confusing anditer_tf_batches
is boilerplate-y.to_tf
automatically concatenates and unsqueezes columns for the user. The original motivation was to let users convert tabular datasets to TensorFlow datasets. But, the semantics are complicated, and it often doesn't work the way you want it to. More importantly, the concatenation and unsqueezing functionality is obsolete with the introduction ofpreprocessors.Concatenator
.The currently recommended
iter_tf_batches
requires lots of boilerplate. You need to specify the output signature, callprepare_dataset_shard
, and yield tensors fromiter_tf_batchse
. We can improve the UX by eliminating callingprepare_dataset_shard
for the user and inferring theoutput_signature
from the dataset schema.Before
After
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.