API: Expand read_csv dtype for categoricals #14503

TomAugspurger · 2016-10-26T16:47:34Z

In #13406 Chris added support for read_csv(..., dtype={'col': 'category'}) (thanks!). This issue is for expanding that syntax to allow a more complete specification of the resulting categorical.

# Your code here
df = pd.read_csv(path, dtype={'col': pd.Categorical(['a', 'b', 'c'], ordered=True})
df = pd.read_csv(path, dtype={'col': ['a', 'b', 'c']})  # shorthand, but unordered only
# we would still accept `dtype={'col': 'category'}` as well, to infer categories

Implementation-wise, I think we can keep all the parsing logic as is, and simply loop over dtype and call set_categories (and maybe as_ordered) on all the categoricals just before returning to the user.

This would help a bit in dask, where their category type inference can fail if the first partition doesn't contain all the categories (see dask/dask#1705). This is why it'd be preferable to do it as an option to read_csv, rather than putting in on the user to followup with a set_categories.

The text was updated successfully, but these errors were encountered:

chris-b1 · 2016-11-07T18:07:43Z

If it matters, there would be at least a little performance to be picked up by modifying the parsing code - you could pass the categories in here:

pandas/pandas/parser.pyx

Line 1523 in 7a2bcb6

cdef _categorical_convert(parser_t *parser, int col,

And shortcut building the categories array, sorting, etc. It would also cause an error to be thrown much earlier if the data has a value not in the specified categories.

TomAugspurger · 2016-11-08T01:30:30Z

Yes, I think you're right that it'd be better to do it in the Cython code.
I was digging through there last week, and it shouldn't be to much extra
effort.

Though I guess we'll need to have an API discussion about what to do if the
user passes dtype={'A': pd.Categorical(['a', 'b'])} and a value outside
those shows up.

Throw an exception
Set to NA

The second option would be consistent with set_categories.

In [1]: pd.Categorical(['a', 'b', 'c']).set_categories(['a', 'b'])
Out[1]:
[a, b, NaN]
Categories (2, object): [a, b]

On Mon, Nov 7, 2016 at 12:07 PM, chris-b1 [email protected] wrote:

If it matters, there would be at least a little performance to be picked
up by modifying the parsing code - you could pass the categories in here:

https://github.com/pandas-dev/pandas/blob/7a2bcb6605bacea858
ec14cfac424898deb568b3/pandas/parser.pyx#L1523

And shortcut building the categories array, sorting, etc. It would also
cause an error to be thrown much earlier if the data has a value not in the
specified categories.

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#14503 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABQHIhfXgauFpjZnATzCRINp34u9bofMks5q72jzgaJpZM4KhaAy
.

TomAugspurger added API Design IO CSV read_csv, to_csv Categorical Categorical Data Type Difficulty Intermediate labels Oct 26, 2016

TomAugspurger added this to the 0.20.0 milestone Oct 26, 2016

This was referenced Nov 17, 2016

API: Expand DataFrame.astype to allow Categorical(categories, ordered) #14676

Closed

[WIP]API: CategoricalType for specifying categoricals #14698

Closed

Make categories and ordered part of CategoricalDtype #14711

Closed

jreback modified the milestones: 0.20.0, Next Major Release Mar 23, 2017

jbrockmendel removed Difficulty Intermediate labels Oct 21, 2019

mroeschke added the Enhancement label Jun 28, 2020

mroeschke removed the API Design label May 2, 2021

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API: Expand read_csv dtype for categoricals #14503

API: Expand read_csv dtype for categoricals #14503

TomAugspurger commented Oct 26, 2016 •

edited

Loading

chris-b1 commented Nov 7, 2016

TomAugspurger commented Nov 8, 2016

API: Expand read_csv dtype for categoricals #14503

API: Expand read_csv dtype for categoricals #14503

Comments

TomAugspurger commented Oct 26, 2016 • edited Loading

chris-b1 commented Nov 7, 2016

TomAugspurger commented Nov 8, 2016

TomAugspurger commented Oct 26, 2016 •

edited

Loading