Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: Expand read_csv dtype for categoricals #14503

Open
TomAugspurger opened this issue Oct 26, 2016 · 2 comments
Open

API: Expand read_csv dtype for categoricals #14503

TomAugspurger opened this issue Oct 26, 2016 · 2 comments
Labels
Categorical Categorical Data Type Enhancement IO CSV read_csv, to_csv

Comments

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Oct 26, 2016

In #13406 Chris added support for read_csv(..., dtype={'col': 'category'}) (thanks!). This issue is for expanding that syntax to allow a more complete specification of the resulting categorical.

# Your code here
df = pd.read_csv(path, dtype={'col': pd.Categorical(['a', 'b', 'c'], ordered=True})
df = pd.read_csv(path, dtype={'col': ['a', 'b', 'c']})  # shorthand, but unordered only
# we would still accept `dtype={'col': 'category'}` as well, to infer categories

Implementation-wise, I think we can keep all the parsing logic as is, and simply loop over dtype and call set_categories (and maybe as_ordered) on all the categoricals just before returning to the user.

This would help a bit in dask, where their category type inference can fail if the first partition doesn't contain all the categories (see dask/dask#1705). This is why it'd be preferable to do it as an option to read_csv, rather than putting in on the user to followup with a set_categories.

@TomAugspurger TomAugspurger added this to the 0.20.0 milestone Oct 26, 2016
@chris-b1
Copy link
Contributor

chris-b1 commented Nov 7, 2016

If it matters, there would be at least a little performance to be picked up by modifying the parsing code - you could pass the categories in here:

cdef _categorical_convert(parser_t *parser, int col,

And shortcut building the categories array, sorting, etc. It would also cause an error to be thrown much earlier if the data has a value not in the specified categories.

@TomAugspurger
Copy link
Contributor Author

Yes, I think you're right that it'd be better to do it in the Cython code.
I was digging through there last week, and it shouldn't be to much extra
effort.

Though I guess we'll need to have an API discussion about what to do if the
user passes dtype={'A': pd.Categorical(['a', 'b'])} and a value outside
those shows up.

  1. Throw an exception
  2. Set to NA

The second option would be consistent with set_categories.

In [1]: pd.Categorical(['a', 'b', 'c']).set_categories(['a', 'b'])
Out[1]:
[a, b, NaN]
Categories (2, object): [a, b]

On Mon, Nov 7, 2016 at 12:07 PM, chris-b1 [email protected] wrote:

If it matters, there would be at least a little performance to be picked
up by modifying the parsing code - you could pass the categories in here:

https://github.com/pandas-dev/pandas/blob/7a2bcb6605bacea858
ec14cfac424898deb568b3/pandas/parser.pyx#L1523

And shortcut building the categories array, sorting, etc. It would also
cause an error to be thrown much earlier if the data has a value not in the
specified categories.


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#14503 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABQHIhfXgauFpjZnATzCRINp34u9bofMks5q72jzgaJpZM4KhaAy
.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Enhancement IO CSV read_csv, to_csv
Projects
None yet
Development

No branches or pull requests

5 participants