Skip to content

Commit

Permalink
ENH: Parametrized CategoricalDtype
Browse files Browse the repository at this point in the history
We extended the CategoricalDtype to accept optional categories and ordered
argument.

```python
pd.CategoricalDtype(categories=['a', 'b'], ordered=True
```

CategoricalDtype is now part of the public API. This allows users to
specify the desired categories and orderedness of an operation ahead of time.
The current behavior, which is still possible with categories=None, the
default, is to infer the categories from whatever is present.

This change will make it easy to implement support for specifying categories
that are know ahead of time in other places e.g. .astype, .read_csv, and the
Series constructor.

Closes pandas-dev#14711
Closes pandas-dev#15078
Closes pandas-dev#14676
  • Loading branch information
TomAugspurger committed Aug 25, 2017
1 parent 1abaecb commit a7eb835
Show file tree
Hide file tree
Showing 21 changed files with 510 additions and 170 deletions.
2 changes: 1 addition & 1 deletion doc/source/advanced.rst
Original file line number Diff line number Diff line change
Expand Up @@ -654,7 +654,7 @@ setting the index of a ``DataFrame/Series`` with a ``category`` dtype would conv
df = pd.DataFrame({'A': np.arange(6),
'B': list('aabbca')})
df['B'] = df['B'].astype('category', categories=list('cab'))
df['B'] = df['B'].astype(pd.CategoricalDtype(list('cab')))
df
df.dtypes
df.B.cat.categories
Expand Down
78 changes: 70 additions & 8 deletions doc/source/categorical.rst
Original file line number Diff line number Diff line change
Expand Up @@ -96,12 +96,19 @@ By passing a :class:`pandas.Categorical` object to a `Series` or assigning it to
df["B"] = raw_cat
df
You can also specify differently ordered categories or make the resulting data ordered, by passing these arguments to ``astype()``:
Anywhere above we passed a keyword ``dtype='category'``, we used the default behavior of

1. categories are inferred from the data
2. categories are unordered.

To control those behaviors, instead of passing ``'category'``, use an instance
of :class:`CategoricalDtype`.

.. ipython:: python
s = pd.Series(["a","b","c","a"])
s_cat = s.astype("category", categories=["b","c","d"], ordered=False)
s = pd.Series(["a", "b", "c", "a"])
cat_type = pd.CategoricalDtype(categories=["b", "c", "d"], ordered=False)
s_cat = s.astype(cat_type)
s_cat
Categorical data has a specific ``category`` :ref:`dtype <basics.dtypes>`:
Expand Down Expand Up @@ -140,6 +147,61 @@ constructor to save the factorize step during normal constructor mode:
splitter = np.random.choice([0,1], 5, p=[0.5,0.5])
s = pd.Series(pd.Categorical.from_codes(splitter, categories=["train", "test"]))
CategoricalDtype
----------------

.. versionchanged:: 0.21.0

A categorical's type is fully described by 1.) its categories (an iterable with
unique values and no missing values), and 2.) its orderedness (a boolean).
This information can be stored in a :class:`~pandas.CategoricalDtype`.
The ``categories`` argument is optional, which implies that the actual categories
should be inferred from whatever is present in the data when the
:class:`pandas.Categorical` is created.

.. ipython:: python
pd.CategoricalDtype(['a', 'b', 'c'])
pd.CategoricalDtype(['a', 'b', 'c'], ordered=True)
pd.CategoricalDtype()
A :class:`~pandas.CategoricalDtype` can be used in any place pandas expects a
`dtype`. For example :func:`pandas.read_csv`, :func:`pandas.DataFrame.astype`,
or the Series constructor.

As a convenience, you can use the string `'category'` in place of a
:class:`pandas.CategoricalDtype` when you want the default behavior of
the categories being unordered, and equal to the set values present in the array.
On other words, ``dtype='category'`` is equivalent to ``dtype=pd.CategoricalDtype()``.

Equality Semantics
~~~~~~~~~~~~~~~~~~

Two instances of :class:`pandas.CategoricalDtype` compare equal whenever the have
the same categories and orderedness. When comparing two unordered categoricals, the
order of the ``categories`` is not considered

.. ipython:: python
c1 = pd.CategoricalDtype(['a', 'b', 'c'], ordered=False)
# Equal, since order is not considered when ordered=False
c1 == pd.CategoricalDtype(['b', 'c', 'a'], ordered=False)
# Unequal, since the second CategoricalDtype is ordered
c1 == pd.CategoricalDtype(['a', 'b', 'c'], ordered=True)
All instances of ``CategoricalDtype`` compare equal to the string ``'category'``

.. ipython:: python
c1 == 'category'
.. warning::

Since ``dtype='category'`` is essentially ``CategoricalDtype(None, False)``,
and since all instances ``CategoricalDtype`` compare equal to ``'`category'``,
all instances of ``CategoricalDtype`` compare equal to a ``CategoricalDtype(None)``

Description
-----------

Expand Down Expand Up @@ -189,7 +251,7 @@ It's also possible to pass in the categories in a specific order:

.. ipython:: python
s = pd.Series(list('babc')).astype('category', categories=list('abcd'))
s = pd.Series(list('babc')).astype(pd.CategoricalDtype(list('abcd')))
s
# categories
Expand Down Expand Up @@ -306,7 +368,7 @@ meaning and certain operations are possible. If the categorical is unordered, ``
s = pd.Series(pd.Categorical(["a","b","c","a"], ordered=False))
s.sort_values(inplace=True)
s = pd.Series(["a","b","c","a"]).astype('category', ordered=True)
s = pd.Series(["a","b","c","a"]).astype(pd.CategoricalDtype(ordered=True))
s.sort_values(inplace=True)
s
s.min(), s.max()
Expand Down Expand Up @@ -406,9 +468,9 @@ categories or a categorical with any list-like object, will raise a TypeError.

.. ipython:: python
cat = pd.Series([1,2,3]).astype("category", categories=[3,2,1], ordered=True)
cat_base = pd.Series([2,2,2]).astype("category", categories=[3,2,1], ordered=True)
cat_base2 = pd.Series([2,2,2]).astype("category", ordered=True)
cat = pd.Series([1,2,3]).astype(pd.CategoricalDtype([3, 2, 1], ordered=True))
cat_base = pd.Series([2,2,2]).astype(pd.CategoricalDtype([3, 2, 1], ordered=True))
cat_base2 = pd.Series([2,2,2]).astype(pd.CategoricalDtype(ordered=True))
cat
cat_base
Expand Down
8 changes: 5 additions & 3 deletions doc/source/merging.rst
Original file line number Diff line number Diff line change
Expand Up @@ -831,7 +831,7 @@ The left frame.
.. ipython:: python
X = pd.Series(np.random.choice(['foo', 'bar'], size=(10,)))
X = X.astype('category', categories=['foo', 'bar'])
X = X.astype(pd.CategoricalDtype(categories=['foo', 'bar']))
left = pd.DataFrame({'X': X,
'Y': np.random.choice(['one', 'two', 'three'], size=(10,))})
Expand All @@ -842,8 +842,10 @@ The right frame.

.. ipython:: python
right = pd.DataFrame({'X': pd.Series(['foo', 'bar']).astype('category', categories=['foo', 'bar']),
'Z': [1, 2]})
right = pd.DataFrame({
'X': pd.Series(['foo', 'bar'], dtype=pd.CategoricalDtype(['foo', 'bar'])),
'Z': [1, 2]
})
right
right.dtypes
Expand Down
24 changes: 24 additions & 0 deletions doc/source/whatsnew/v0.21.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,8 @@ Check the :ref:`API Changes <whatsnew_0210.api_breaking>` and :ref:`deprecations
New features
~~~~~~~~~~~~

- New user-facing :class:`CategoricalDtype` for specifying categorical independent
of the data (:issue:`14711`, :issue:`15078`)
- Support for `PEP 519 -- Adding a file system path protocol
<https://www.python.org/dev/peps/pep-0519/>`_ on most readers and writers (:issue:`13823`)
- Added ``__fspath__`` method to :class:`~pandas.HDFStore`, :class:`~pandas.ExcelFile`,
Expand Down Expand Up @@ -106,6 +108,28 @@ This does not permit that column to be accessed as an attribute:

Both of these now raise a ``UserWarning`` about the potential for unexpected behavior. See :ref:`Attribute Access <indexing.attribute_access>`.

.. _whatsnew_0210.enhancements.categorical_dtype:

``CategoricalDtype`` for specifying categoricals
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

:class:`CategoricalDtype` has been added to the public API and expanded to
include the ``categories`` and ``ordered`` attributes. A ``CategoricalDtype``
can be used to specify the set of categories and orderedness of an array,
independent of the data themselves. This can be useful, e.g., when converting
string data to a ``Categorical``:

.. ipython:: python

s = pd.Series(['a', 'b', 'c', 'a']) # strings
dtype = pd.CategoricalDtype(categories=['a', 'b', 'c', 'd'], ordered=True)
s.astype(dtype)

The ``.dtype`` property of a ``Categorical``, ``CategoricalIndex`` or a
``Series`` with categorical type will now return an instance of ``CategoricalDtype``.

See :ref:`CategoricalDtype <categorical.categoricaldtype>` for more.

.. _whatsnew_0210.enhancements.other:

Other Enhancements
Expand Down
1 change: 1 addition & 0 deletions pandas/core/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@

from pandas.core.algorithms import factorize, unique, value_counts
from pandas.core.dtypes.missing import isna, isnull, notna, notnull
from pandas.core.dtypes.dtypes import CategoricalDtype
from pandas.core.categorical import Categorical
from pandas.core.groupby import Grouper
from pandas.io.formats.format import set_eng_float_format
Expand Down
Loading

0 comments on commit a7eb835

Please sign in to comment.