API: Expand DataFrame.astype to allow Categorical(categories, ordered) #14676

TomAugspurger · 2016-11-17T13:55:54Z

A small, complete example of the issue

This is a proposal to allow something like

df.astype({'A': pd.CategoricalDtype(['a', 'b', 'c', 'd'], ordered=True})

Currently, it can be awkward to convert many columns in a DataFrame to a Categorical with control over the categories and orderedness. If you just want to use the defaults, it's not so bad with .astype:

In [5]: df = pd.DataFrame({"A": list('abc'), 'B': list('def')})

In [6]: df
Out[6]:
   A  B
0  a  d
1  b  e
2  c  f

In [8]: df.astype({"A": 'category', 'B': 'category'}).dtypes
Out[8]:
A    category
B    category
dtype: object

If you need to control categories or ordered, your best off with

In [20]: mapping = {'A': lambda x: x.A.astype('category').cat.set_categories(['a', 'b'], ordered=True),
    ...:            'B': lambda x: x.B.astype('category').cat.set_categories(['d', 'f', 'e'], ordered=False)}

In [21]: df.assign(**mapping)
Out[21]:
     A  B
0    a  d
1    b  e
2  NaN  f

By expanding astype to accept instances of Categorical, you remove the need for the lambdas and you can do conversions of other types at the same time.

This would mirror the semantics in #14503

Updated to change pd.Categorical(...) to a new/modified pd.CategoricalDtype(...) based on the discussion below.

The text was updated successfully, but these errors were encountered:

jreback · 2016-11-17T14:34:01Z

so our current semantics are wrong on CategoricalDtype, IOW, we really should actually have the categories (and ordered) as part of the actual dtype, then that would be what you would pass here.

But since we don't support that ATM, passing an 'empty' Categorical is prob reasonable.

TomAugspurger · 2016-11-17T14:40:48Z

But since we don't support that ATM, passing an 'empty' Categorical is prob reasonable.

Can you clarify that? Are you saying anywhere we currently allow dtype='category', we really mean dtype=pd.Categorical()? Because I don't think those are the same. The dtype='category' is on a higher level than any instance of the type, including an empty Categorical. I could be misunderstanding though.

jreback · 2016-11-17T15:51:58Z

what I am saying is that in reality categorical dtype is actually mis-defined (has always been); we have a single instance of it. In reality it should have the categories IN the dtype (and ordered). But that's not how its actually setup. This is more theoretical, because its sort of hard to change it now.

TomAugspurger · 2016-11-17T16:00:25Z

Mmm, I think I see what you're saying now. In other words, currently pd.Categorical(values, categories, ordered) is a value constructor. We want something like pd.CategoricalType(categories, ordered) as a type constructor (which is what I had in mind when writing up the initial issue). I'll do some thinking / researching on how other languages handle this.

jreback · 2016-11-17T16:07:51Z

right. It is possible to add a constructor like that to CategoricalDtype.

We extended the CategoricalDtype to accept optional categories and ordered argument. ```python pd.CategoricalDtype(categories=['a', 'b'], ordered=True ``` CategoricalDtype is now part of the public API. This allows users to specify the desired categories and orderedness of an operation ahead of time. The current behavior, which is still possible with categories=None, the default, is to infer the categories from whatever is present. This change will make it easy to implement support for specifying categories that are know ahead of time in other places e.g. .astype, .read_csv, and the Series constructor. Closes pandas-dev#14711 Closes pandas-dev#15078 Closes pandas-dev#14676

Closes #14711 Closes #15078 Closes #14676

Closes pandas-dev#14711 Closes pandas-dev#15078 Closes pandas-dev#14676

TomAugspurger added API Design Categorical Categorical Data Type Dtype Conversions Unexpected or buggy dtype conversions labels Nov 17, 2016

TomAugspurger added this to the 0.20.0 milestone Nov 17, 2016

jreback changed the title ~~API: Expand DataFrame.astype to allow Categorical(categories, ordered)~~ API: Expand DataFrame.astype`to allow Categorical(categories, ordered) Nov 17, 2016

jreback changed the title ~~API: Expand DataFrame.astype`to allow Categorical(categories, ordered)~~ API: Expand DataFrame.astype to allow Categorical(categories, ordered) Nov 17, 2016

jreback added Difficulty Intermediate Enhancement labels Nov 17, 2016

This was referenced Nov 20, 2016

[WIP]API: CategoricalType for specifying categoricals #14698

Closed

Make categories and ordered part of CategoricalDtype #14711

Closed

jreback modified the milestones: 0.20.0, Next Major Release Mar 23, 2017

TomAugspurger mentioned this issue Apr 16, 2017

Categorical type #16015

Merged

jreback modified the milestones: Next Major Release, 0.21.0 Sep 23, 2017

jreback closed this as completed in #16015 Sep 23, 2017

jreback pushed a commit that referenced this issue Sep 23, 2017

Categorical type (#16015)

e57f189

Closes #14711 Closes #15078 Closes #14676

alanbato pushed a commit to alanbato/pandas that referenced this issue Nov 10, 2017

Categorical type (pandas-dev#16015)

da7ad15

Closes pandas-dev#14711 Closes pandas-dev#15078 Closes pandas-dev#14676

No-Stream pushed a commit to No-Stream/pandas that referenced this issue Nov 28, 2017

Categorical type (pandas-dev#16015)

3bb7929

Closes pandas-dev#14711 Closes pandas-dev#15078 Closes pandas-dev#14676

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API: Expand DataFrame.astype to allow Categorical(categories, ordered) #14676

API: Expand DataFrame.astype to allow Categorical(categories, ordered) #14676

TomAugspurger commented Nov 17, 2016 •

edited

Loading

jreback commented Nov 17, 2016

TomAugspurger commented Nov 17, 2016

jreback commented Nov 17, 2016

TomAugspurger commented Nov 17, 2016

jreback commented Nov 17, 2016

API: Expand DataFrame.astype to allow Categorical(categories, ordered) #14676

API: Expand DataFrame.astype to allow Categorical(categories, ordered) #14676

Comments

TomAugspurger commented Nov 17, 2016 • edited Loading

A small, complete example of the issue

jreback commented Nov 17, 2016

TomAugspurger commented Nov 17, 2016

jreback commented Nov 17, 2016

TomAugspurger commented Nov 17, 2016

jreback commented Nov 17, 2016

TomAugspurger commented Nov 17, 2016 •

edited

Loading