Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support category dtypes #24

Merged
merged 3 commits into from
May 8, 2019
Merged

support category dtypes #24

merged 3 commits into from
May 8, 2019

Conversation

caddac
Copy link
Contributor

@caddac caddac commented May 7, 2019

Resolves #22

@multimeric
Copy link
Owner

multimeric commented May 8, 2019

Hi David, thanks for identifying and fixing this issue. I have two suggestions on how I'd like this implemented.

Firstly, it seems that the more elegant way to do this broad type checking is with the pandas.api.types functions, as explained here: https://stackoverflow.com/a/38185759/2148718. These seem to have existed since 0.19 (2016), so I'm happy to bump the pandas version requirement and use them. This would, for example, give us access to pandas.api.types.is_categorical_dtype. This would also stop pandas throwing an error if you try to check a categorical column, and instead it would just return false, which is ideal.

Secondly, I think we could improve your actual skip-empty behaviour for categoricals:

validated = (series.astype(object).str.len() > 0) & simple_validation

Currently you're converting the column to a string and checking for empty strings, but I don't believe this is the idiomatic way to treat categoricals. In theory the user might actually have the empty string as one of their categories, and so we don't want to ignore a column in this case. On the other hand, the documentation says "All values of categorical data are either in categories or np.nan", which makes me think that, like with numericals, we should treat only np.nan as an empty categorical column.

@caddac
Copy link
Contributor Author

caddac commented May 8, 2019

Cool, will implement these changes.

For the dtype check, are you thinking we should use the pandas type checks for all checks? i.e.

        if is_categorical_dtype(series):
            validated = (series.astype(object).str.len() > 0) & simple_validation
        elif is_numeric_dtype(series):
            validated = ~series.isna() & simple_validation
        else:
            validated = (series.str.len() > 0) & simple_validation

or just for the categorical check?

@multimeric
Copy link
Owner

Might as well do it in both cases. But also make sure to bump the pandas version in the setup.py also.

@caddac
Copy link
Contributor Author

caddac commented May 8, 2019

I'm not seeing a pandas version in the setup.py. Seems it just installing the latest?

@multimeric
Copy link
Owner

Yes, exactly. Which is why it needs a constraint

pandas version 0.21 is required for Series.isna() and 0.19 is required for is_categorical_dtype and is_numeric_dtype
@caddac
Copy link
Contributor Author

caddac commented May 8, 2019

Allright, I think I've got this all wrapped up now. Ended up having to use pandas>=0.21 for the Series.isna() call which failed on pandas 0.19.

@multimeric
Copy link
Owner

Hmm, unless I use that method elsewhere, you might as well change it to isnull() (https://pandas.pydata.org/pandas-docs/version/0.19/generated/pandas.Series.isnull.html#pandas.Series.isnull), just so we can support the most versions of pandas possible.

@caddac
Copy link
Contributor Author

caddac commented May 8, 2019

OK, working with pandas 0.19!

@caddac
Copy link
Contributor Author

caddac commented May 8, 2019

For some reason the comments on this PR are completely out of order, I've submitted a ticket to Github. Very confusing...

@multimeric
Copy link
Owner

Excellent, looks good! I tweaked some of your docstrings, but nothing major.

For some reason the comments on this PR are completely out of order, I've submitted a ticket to Github. Very confusing...

Yep I also noticed that. Never seen it before...

@multimeric multimeric merged commit fe3171d into multimeric:master May 8, 2019
@multimeric
Copy link
Owner

This is now available as a GitHub release and on PyPi as 0.3.4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

get_errors() fails when Series.dtype=category
2 participants