-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
convert mixed column types to string #623
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
|
||
def test_mixed_types(): | ||
# TODO: datetime/str mixed types | ||
# don't work |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this TODO fixed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, it seems to require a bit more work, and I was wondering whether such mixed types could really appear in the wild
no, it seems to require a bit more work, and I was wondering whether such mixed types could really appear in the wild
ok, let's punt then
|
…ns with a slow pandas apply
skrub/_table_vectorizer.py
Outdated
@@ -477,6 +478,11 @@ def _auto_cast(self, X: pd.DataFrame) -> pd.DataFrame: | |||
X[col] = X[col].astype(np.float64) | |||
X[col].fillna(value=np.nan, inplace=True) | |||
|
|||
# if object, first convert to string to avoid mixed types | |||
for col in X.columns: | |||
if pd.api.types.is_object_dtype(X[col]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this necessary, or can we just look at X.dtypes (at least in the case that X is a dataframe)? I suspect that X.dtypes is less costly than is_object_dtypes. It's also more of a first-level API and thus I suspect that it is more likely to generalize across dataframe implementations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Merging
(Sorry I think I messed up when applying a suggestion for #623 )
(Sorry I think I messed up when applying a suggestion for #623 )
Fix #622, which was due to mixed types column.