-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Datasets] Add from_tf
#29591
[Datasets] Add from_tf
#29591
Conversation
@@ -1326,6 +1327,58 @@ def convert(ds: "datasets.Dataset") -> Dataset[ArrowRow]: | |||
) | |||
|
|||
|
|||
@PublicAPI | |||
def from_tf( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we expand this?
def from_tf( | |
def from_tensorflow( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a method called Dataset.to_tf
. I prefer tensorflow
over tf
, but we should be consistent.
What do you think? We could always rename Dataset.to_tf
to Dataset.to_tensorflow
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah right... Maybe we can:
- Name this
from_tensorflow
. - Create a
to_tensorflow
. - Have
to_tf
redirect toto_tensorflow
and eventually deprecate it.
But will leave to folks like @clarkzinzow @c21 to make the call here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd actually vote for renaming these to from_tf_dataset
, to_tf_dataset
, from_torch_dataset
, to_torch_dataset
, etc. This is less ambiguous ("Am I getting a TensorFlow tensor or dataset with this API?"), discoverable, and better matches what the rest of the ecosystem is doing, e.g. HuggingFace and Petastorm.
I think that tf
is a common enough shorthand that we shouldn't expand it to tensorflow
, but happy to hear others' opinions, and maybe we can do a quick ecosystem check to verify this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@@ -1326,6 +1327,58 @@ def convert(ds: "datasets.Dataset") -> Dataset[ArrowRow]: | |||
) | |||
|
|||
|
|||
@PublicAPI | |||
def from_tf( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd actually vote for renaming these to from_tf_dataset
, to_tf_dataset
, from_torch_dataset
, to_torch_dataset
, etc. This is less ambiguous ("Am I getting a TensorFlow tensor or dataset with this API?"), discoverable, and better matches what the rest of the ecosystem is doing, e.g. HuggingFace and Petastorm.
I think that tf
is a common enough shorthand that we shouldn't expand it to tensorflow
, but happy to hear others' opinions, and maybe we can do a quick ecosystem check to verify this.
Returns: | ||
A :class:`Dataset` that contains the samples stored in the TensorFlow dataset. | ||
""" # noqa: E501 | ||
return from_items(list(dataset.as_numpy_iterator())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Misc. thing to note that I hit before in tests: .as_numpy_iterator()
doesn't work with ragged tensors. tensorflow/tensorflow#53149
Parity with from_huggingface. Also, SimpleTensorflowDatasource is clunky. For more motivation, read ray-project#29430. Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Balaji [email protected]
Blocked on
Dataset.take
flattensArrowTensorType
tensors #29590Why are these changes needed?
Parity with
from_huggingface
. Also,SimpleTensorflowDatasource
is clunky. For more motivation, read #29430.Related issue number
Towards #29430. See also #29589.
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.