-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Datasets] Support tensor columns in to_tf
and to_torch
.
#24752
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2380,6 +2380,27 @@ def to_tf( | |
if isinstance(output_signature, list): | ||
output_signature = tuple(output_signature) | ||
|
||
def get_df_values(df: "pandas.DataFrame") -> np.ndarray: | ||
# TODO(Clark): Support unsqueezing column dimension API, similar to | ||
# to_torch(). | ||
try: | ||
clarkzinzow marked this conversation as resolved.
Show resolved
Hide resolved
|
||
values = df.values | ||
except ValueError as e: | ||
import pandas as pd | ||
|
||
# Pandas DataFrame.values doesn't support extension arrays in all | ||
# supported Pandas versions, so we check to see if this DataFrame | ||
# contains any extensions arrays and do a manual conversion if so. | ||
# See https://github.com/pandas-dev/pandas/pull/43160. | ||
if any( | ||
isinstance(dtype, pd.api.extensions.ExtensionDtype) | ||
for dtype in df.dtypes | ||
): | ||
values = np.stack([col.to_numpy() for _, col in df.items()], axis=1) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. hmm do we want to conver all columns to numpy if any column is an ExtensionDtype. Or should we only do col.to_numpy() only if the dtype of that particular columns is an ExtensionDtype There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It doesn't matter, |
||
else: | ||
raise e from None | ||
return values | ||
|
||
def make_generator(): | ||
for batch in self.iter_batches( | ||
prefetch_blocks=prefetch_blocks, | ||
|
@@ -2392,13 +2413,13 @@ def make_generator(): | |
|
||
features = None | ||
if feature_columns is None: | ||
features = batch.values | ||
features = get_df_values(batch) | ||
elif isinstance(feature_columns, list): | ||
if all(isinstance(column, str) for column in feature_columns): | ||
features = batch[feature_columns].values | ||
features = get_df_values(batch[feature_columns]) | ||
elif all(isinstance(columns, list) for columns in feature_columns): | ||
features = tuple( | ||
batch[columns].values for columns in feature_columns | ||
get_df_values(batch[columns]) for columns in feature_columns | ||
) | ||
else: | ||
raise ValueError( | ||
|
@@ -2407,7 +2428,7 @@ def make_generator(): | |
) | ||
elif isinstance(feature_columns, dict): | ||
features = { | ||
key: batch[columns].values | ||
key: get_df_values(batch[columns]) | ||
for key, columns in feature_columns.items() | ||
} | ||
else: | ||
|
@@ -2416,8 +2437,6 @@ def make_generator(): | |
f"but got a `{type(feature_columns).__name__}` instead." | ||
) | ||
|
||
# TODO(Clark): Support batches containing our extension array | ||
# TensorArray. | ||
if label_column: | ||
yield features, targets | ||
else: | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 doc updates. Could we also add some explicit examples / text in the docs on how to do this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm document how to do what? This should now transparently work with tensor columns without any considerations by the user, so there wouldn't be anything special to document at exchange time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's still not clear to me, and someone only passably familiar with the tensor code, how exactly I would use it. Could we have a runnable example showing it end to end with a tensor dataset?
Basically whatever's in the unit test, but in a more friendly example form.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed on having an e2e example for tensor data being a good idea, and I'm about to work on creating an e2e example for tensor data as part of the Datasets GA docs work in a separate PR, but the point that I'm making here is that the user shouldn't have to do anything differently in the
.to_torch()
or.to_tf()
call just because their data contains tensor columns.The tensor column --> ML framework tensor conversion will happen automatically, without any tensor-column-specific args to
.to_torch()
or.to_tf()
, or special considerations for the upstream tensor column creation other than what's already documented in the tensor column guide: creating the tensor column in the first place. For.to_tf()
, the tensor spec of the columns needs to be given, but that is already a requirement for all columns.Since this example work is already planned for before GA, could we merge this in order to unblock the user PoC that depends on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the example I'm asking for is more in length than this thread already. Let's set a high standard for documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed up an example so we can resolve this. I'll have to rewrite/delete this example early next week once other work is merged, which is one of the reasons that I didn't want to block the user PoC on this example, so I'd like it if we could be a bit more pragmatic in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I don't see any harm in blocking this PR on the e2e example, in fact. We should be treating docs like unit tests, and not a separate "optional" component. There shouldn't really be a separate e2e example effort, for example.
I left a separate comment on making the example here more fleshed out (maybe this is copying work from your other effort).