Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: Expand DataFrame.astype to allow Categorical(categories, ordered) #14676

Closed
TomAugspurger opened this issue Nov 17, 2016 · 5 comments
Closed
Labels
API Design Categorical Categorical Data Type Dtype Conversions Unexpected or buggy dtype conversions Enhancement
Milestone

Comments

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Nov 17, 2016

A small, complete example of the issue

This is a proposal to allow something like

df.astype({'A': pd.CategoricalDtype(['a', 'b', 'c', 'd'], ordered=True})

Currently, it can be awkward to convert many columns in a DataFrame to a Categorical with control over the categories and orderedness. If you just want to use the defaults, it's not so bad with .astype:

In [5]: df = pd.DataFrame({"A": list('abc'), 'B': list('def')})

In [6]: df
Out[6]:
   A  B
0  a  d
1  b  e
2  c  f

In [8]: df.astype({"A": 'category', 'B': 'category'}).dtypes
Out[8]:
A    category
B    category
dtype: object

If you need to control categories or ordered, your best off with

In [20]: mapping = {'A': lambda x: x.A.astype('category').cat.set_categories(['a', 'b'], ordered=True),
    ...:            'B': lambda x: x.B.astype('category').cat.set_categories(['d', 'f', 'e'], ordered=False)}

In [21]: df.assign(**mapping)
Out[21]:
     A  B
0    a  d
1    b  e
2  NaN  f

By expanding astype to accept instances of Categorical, you remove the need for the lambdas and you can do conversions of other types at the same time.

This would mirror the semantics in #14503

Updated to change pd.Categorical(...) to a new/modified pd.CategoricalDtype(...) based on the discussion below.

@TomAugspurger TomAugspurger added API Design Categorical Categorical Data Type Dtype Conversions Unexpected or buggy dtype conversions labels Nov 17, 2016
@TomAugspurger TomAugspurger added this to the 0.20.0 milestone Nov 17, 2016
@jreback jreback changed the title API: Expand DataFrame.astype to allow Categorical(categories, ordered) API: Expand DataFrame.astype`to allow Categorical(categories, ordered) Nov 17, 2016
@jreback jreback changed the title API: Expand DataFrame.astype`to allow Categorical(categories, ordered) API: Expand DataFrame.astype to allow Categorical(categories, ordered) Nov 17, 2016
@jreback
Copy link
Contributor

jreback commented Nov 17, 2016

so our current semantics are wrong on CategoricalDtype, IOW, we really should actually have the categories (and ordered) as part of the actual dtype, then that would be what you would pass here.

But since we don't support that ATM, passing an 'empty' Categorical is prob reasonable.

@TomAugspurger
Copy link
Contributor Author

But since we don't support that ATM, passing an 'empty' Categorical is prob reasonable.

Can you clarify that? Are you saying anywhere we currently allow dtype='category', we really mean dtype=pd.Categorical()? Because I don't think those are the same. The dtype='category' is on a higher level than any instance of the type, including an empty Categorical. I could be misunderstanding though.

@jreback
Copy link
Contributor

jreback commented Nov 17, 2016

what I am saying is that in reality categorical dtype is actually mis-defined (has always been); we have a single instance of it. In reality it should have the categories IN the dtype (and ordered). But that's not how its actually setup. This is more theoretical, because its sort of hard to change it now.

@TomAugspurger
Copy link
Contributor Author

Mmm, I think I see what you're saying now. In other words, currently pd.Categorical(values, categories, ordered) is a value constructor. We want something like pd.CategoricalType(categories, ordered) as a type constructor (which is what I had in mind when writing up the initial issue). I'll do some thinking / researching on how other languages handle this.

@jreback
Copy link
Contributor

jreback commented Nov 17, 2016

right. It is possible to add a constructor like that to CategoricalDtype.

@jreback jreback modified the milestones: 0.20.0, Next Major Release Mar 23, 2017
TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Aug 25, 2017
We extended the CategoricalDtype to accept optional categories and ordered
argument.

```python
pd.CategoricalDtype(categories=['a', 'b'], ordered=True
```

CategoricalDtype is now part of the public API. This allows users to
specify the desired categories and orderedness of an operation ahead of time.
The current behavior, which is still possible with categories=None, the
default, is to infer the categories from whatever is present.

This change will make it easy to implement support for specifying categories
that are know ahead of time in other places e.g. .astype, .read_csv, and the
Series constructor.

Closes pandas-dev#14711
Closes pandas-dev#15078
Closes pandas-dev#14676
TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Aug 30, 2017
We extended the CategoricalDtype to accept optional categories and ordered
argument.

```python
pd.CategoricalDtype(categories=['a', 'b'], ordered=True
```

CategoricalDtype is now part of the public API. This allows users to
specify the desired categories and orderedness of an operation ahead of time.
The current behavior, which is still possible with categories=None, the
default, is to infer the categories from whatever is present.

This change will make it easy to implement support for specifying categories
that are know ahead of time in other places e.g. .astype, .read_csv, and the
Series constructor.

Closes pandas-dev#14711
Closes pandas-dev#15078
Closes pandas-dev#14676
TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Aug 31, 2017
We extended the CategoricalDtype to accept optional categories and ordered
argument.

```python
pd.CategoricalDtype(categories=['a', 'b'], ordered=True
```

CategoricalDtype is now part of the public API. This allows users to
specify the desired categories and orderedness of an operation ahead of time.
The current behavior, which is still possible with categories=None, the
default, is to infer the categories from whatever is present.

This change will make it easy to implement support for specifying categories
that are know ahead of time in other places e.g. .astype, .read_csv, and the
Series constructor.

Closes pandas-dev#14711
Closes pandas-dev#15078
Closes pandas-dev#14676
TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Sep 6, 2017
We extended the CategoricalDtype to accept optional categories and ordered
argument.

```python
pd.CategoricalDtype(categories=['a', 'b'], ordered=True
```

CategoricalDtype is now part of the public API. This allows users to
specify the desired categories and orderedness of an operation ahead of time.
The current behavior, which is still possible with categories=None, the
default, is to infer the categories from whatever is present.

This change will make it easy to implement support for specifying categories
that are know ahead of time in other places e.g. .astype, .read_csv, and the
Series constructor.

Closes pandas-dev#14711
Closes pandas-dev#15078
Closes pandas-dev#14676
TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Sep 10, 2017
We extended the CategoricalDtype to accept optional categories and ordered
argument.

```python
pd.CategoricalDtype(categories=['a', 'b'], ordered=True
```

CategoricalDtype is now part of the public API. This allows users to
specify the desired categories and orderedness of an operation ahead of time.
The current behavior, which is still possible with categories=None, the
default, is to infer the categories from whatever is present.

This change will make it easy to implement support for specifying categories
that are know ahead of time in other places e.g. .astype, .read_csv, and the
Series constructor.

Closes pandas-dev#14711
Closes pandas-dev#15078
Closes pandas-dev#14676
TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Sep 13, 2017
We extended the CategoricalDtype to accept optional categories and ordered
argument.

```python
pd.CategoricalDtype(categories=['a', 'b'], ordered=True
```

CategoricalDtype is now part of the public API. This allows users to
specify the desired categories and orderedness of an operation ahead of time.
The current behavior, which is still possible with categories=None, the
default, is to infer the categories from whatever is present.

This change will make it easy to implement support for specifying categories
that are know ahead of time in other places e.g. .astype, .read_csv, and the
Series constructor.

Closes pandas-dev#14711
Closes pandas-dev#15078
Closes pandas-dev#14676
TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Sep 13, 2017
We extended the CategoricalDtype to accept optional categories and ordered
argument.

```python
pd.CategoricalDtype(categories=['a', 'b'], ordered=True
```

CategoricalDtype is now part of the public API. This allows users to
specify the desired categories and orderedness of an operation ahead of time.
The current behavior, which is still possible with categories=None, the
default, is to infer the categories from whatever is present.

This change will make it easy to implement support for specifying categories
that are know ahead of time in other places e.g. .astype, .read_csv, and the
Series constructor.

Closes pandas-dev#14711
Closes pandas-dev#15078
Closes pandas-dev#14676
TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Sep 13, 2017
We extended the CategoricalDtype to accept optional categories and ordered
argument.

```python
pd.CategoricalDtype(categories=['a', 'b'], ordered=True
```

CategoricalDtype is now part of the public API. This allows users to
specify the desired categories and orderedness of an operation ahead of time.
The current behavior, which is still possible with categories=None, the
default, is to infer the categories from whatever is present.

This change will make it easy to implement support for specifying categories
that are know ahead of time in other places e.g. .astype, .read_csv, and the
Series constructor.

Closes pandas-dev#14711
Closes pandas-dev#15078
Closes pandas-dev#14676
TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Sep 15, 2017
We extended the CategoricalDtype to accept optional categories and ordered
argument.

```python
pd.CategoricalDtype(categories=['a', 'b'], ordered=True
```

CategoricalDtype is now part of the public API. This allows users to
specify the desired categories and orderedness of an operation ahead of time.
The current behavior, which is still possible with categories=None, the
default, is to infer the categories from whatever is present.

This change will make it easy to implement support for specifying categories
that are know ahead of time in other places e.g. .astype, .read_csv, and the
Series constructor.

Closes pandas-dev#14711
Closes pandas-dev#15078
Closes pandas-dev#14676
TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Sep 15, 2017
We extended the CategoricalDtype to accept optional categories and ordered
argument.

```python
pd.CategoricalDtype(categories=['a', 'b'], ordered=True
```

CategoricalDtype is now part of the public API. This allows users to
specify the desired categories and orderedness of an operation ahead of time.
The current behavior, which is still possible with categories=None, the
default, is to infer the categories from whatever is present.

This change will make it easy to implement support for specifying categories
that are know ahead of time in other places e.g. .astype, .read_csv, and the
Series constructor.

Closes pandas-dev#14711
Closes pandas-dev#15078
Closes pandas-dev#14676
TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Sep 17, 2017
We extended the CategoricalDtype to accept optional categories and ordered
argument.

```python
pd.CategoricalDtype(categories=['a', 'b'], ordered=True
```

CategoricalDtype is now part of the public API. This allows users to
specify the desired categories and orderedness of an operation ahead of time.
The current behavior, which is still possible with categories=None, the
default, is to infer the categories from whatever is present.

This change will make it easy to implement support for specifying categories
that are know ahead of time in other places e.g. .astype, .read_csv, and the
Series constructor.

Closes pandas-dev#14711
Closes pandas-dev#15078
Closes pandas-dev#14676
TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Sep 20, 2017
We extended the CategoricalDtype to accept optional categories and ordered
argument.

```python
pd.CategoricalDtype(categories=['a', 'b'], ordered=True
```

CategoricalDtype is now part of the public API. This allows users to
specify the desired categories and orderedness of an operation ahead of time.
The current behavior, which is still possible with categories=None, the
default, is to infer the categories from whatever is present.

This change will make it easy to implement support for specifying categories
that are know ahead of time in other places e.g. .astype, .read_csv, and the
Series constructor.

Closes pandas-dev#14711
Closes pandas-dev#15078
Closes pandas-dev#14676
@jreback jreback modified the milestones: Next Major Release, 0.21.0 Sep 23, 2017
TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Sep 23, 2017
We extended the CategoricalDtype to accept optional categories and ordered
argument.

```python
pd.CategoricalDtype(categories=['a', 'b'], ordered=True
```

CategoricalDtype is now part of the public API. This allows users to
specify the desired categories and orderedness of an operation ahead of time.
The current behavior, which is still possible with categories=None, the
default, is to infer the categories from whatever is present.

This change will make it easy to implement support for specifying categories
that are know ahead of time in other places e.g. .astype, .read_csv, and the
Series constructor.

Closes pandas-dev#14711
Closes pandas-dev#15078
Closes pandas-dev#14676
TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Sep 23, 2017
We extended the CategoricalDtype to accept optional categories and ordered
argument.

```python
pd.CategoricalDtype(categories=['a', 'b'], ordered=True
```

CategoricalDtype is now part of the public API. This allows users to
specify the desired categories and orderedness of an operation ahead of time.
The current behavior, which is still possible with categories=None, the
default, is to infer the categories from whatever is present.

This change will make it easy to implement support for specifying categories
that are know ahead of time in other places e.g. .astype, .read_csv, and the
Series constructor.

Closes pandas-dev#14711
Closes pandas-dev#15078
Closes pandas-dev#14676
jreback pushed a commit that referenced this issue Sep 23, 2017
alanbato pushed a commit to alanbato/pandas that referenced this issue Nov 10, 2017
No-Stream pushed a commit to No-Stream/pandas that referenced this issue Nov 28, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Categorical Categorical Data Type Dtype Conversions Unexpected or buggy dtype conversions Enhancement
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants