[WIP]API: CategoricalType for specifying categoricals #14698

TomAugspurger · 2016-11-20T02:47:15Z

This adds a new top-level class pd.CategoricalType for representing the type information on a Categorical, without the values. Users can pass this anywhere they would previously pass 'category' if they need to specify non-default values for categories or ordered (you can still use 'category' of course):

In [1]: from pandas import *

In [2]: paste
    >>> t = CategoricalType(categories=['b', 'a'], ordered=True)
    >>> s = Series(['a', 'a', 'b', 'b', 'a'])
    >>> c = s.astype(t)
    >>> c

## -- End pasted text --
Out[2]:
0    a
1    a
2    b
3    b
4    a
dtype: category
Categories (2, object): [b < a]

This is the simplest possible change for now. A bigger change would be to make c.dtype return a CategoricalType instead of 'category'. We could probably do that in a way that's backwards compatible, but I'll need to think on it a bit.

Other places this should work

select_dtypes (currently returns all categoricals; should just be ones with same CategoricalType?)
read_csv: need to implement, see API: Expand read_csv dtype for categoricals #14503
Series(..., dtype=...)
probably others.

Implementation-wise I need to document, more tests, clean up some things like the repr. But I wanted to get this out there for discussion. @JanSchulz you might be interested.

Simple implementation for now.

jreback · 2016-11-20T02:52:07Z

pandas/core/categorical.py

+
+
+class CategoricalType(CategoricalDtype):
+    """


this should be in with the rest of the dtypes

to be honest i wouldn't create this; just add it optionally into the existing

Is that OK with the caching and stuff that is done on CategoricalDType? This is on CategoricalyDtype:

def __new__(cls): try: return cls._cache[cls.name] except KeyError: c = object.__new__(cls) cls._cache[cls.name] = c return c

I haven't messed with extension types much. We could make the keys of that internal dict reflect the categories and ordered attributes.

yes you would had optional attributes and cache based on them

K, this seems to be working. Just have to convert the categories to a tuple so that they're hashable. Thanks.

TomAugspurger · 2016-11-20T12:47:37Z

What would the ideal __repr__ of CategoricalDtype be? Currently it's just category. For places like DataFrame.info it's probably best to keep that as is, but that's not as useful if you're inpsecing CategoricalDtype itself.

Also, for equality semantics, this is what I have right now

    An instance of ``CategoricalDtype`` compares equal with any other
    instance of ``CategoricalDtype``, regardless of categories or ordered.
    In addition they compare equal to the string ``'category'``.
    To check whether two instances of a ``CategoricalDtype`` match,
    use the ``is`` operator.

    >>> t1 = CategoricalDtype(['a', 'b'], ordered=True)
    >>> t2 = CategoricalDtype(['a', 'c'], ordered=False)
    >>> t1 == t2
    True
    >>> t1 == 'category'
    True
    >>> t1 is t2
    False
    >>> t1 is CategoricalDtype(['a', 'b'], ordered=True)
    True

though I don't expect people to be working with these objects that much.

jreback · 2016-11-21T11:29:40Z

pandas/types/dtypes.py

+    Type for categorical data with the categories and orderedness,
+    but not the values
+
+    .. versionadded:: 0.20.0


this has been around for quite some time, you are adding parameter support in 0.20.0

jreback · 2016-11-21T11:30:40Z

pandas/types/dtypes.py

+
+    Parameters
+    ----------
+    categories : list or None


list-like, use similar to whats in Categorical now

jreback · 2016-11-21T11:31:11Z

pandas/types/dtypes.py

+
+    Examples
+    --------
+    >>> t = CategoricalDtype(categories=['b', 'a'], ordered=True)


these examples are not relevant. This should be not include Series. This is a self-contained type.

jreback · 2016-11-21T11:32:11Z

pandas/types/dtypes.py

    name = 'category'
    type = CategoricalDtypeType
    kind = 'O'
    str = '|O08'
    base = np.dtype('O')
    _cache = {}

-    def __new__(cls):
+    def __new__(cls, categories=None, ordered=False):
+        categories_ = categories if categories is None else tuple(categories)


this needs all of the validation logic (from Categorical). on the actual categories.

jreback · 2016-11-21T11:34:27Z

you need to move over the repr (of the categories) and the validation logic on the categories to this class.

Further Categorical should use this internally for storage of categories / ordered.

This is a fairly invasive change and needs some thought.

jreback · 2016-11-21T11:35:16Z

I would split off the actual issue change and make that a follow up PR. Just putting in place the correct infrastructure can be the scope of this PR>

TomAugspurger · 2016-11-21T13:06:12Z

This is a fairly invasive change and needs some thought.

Yep. I had hoped to do the minimal change of just providing a new API, but let's do it right the first time. I'll spend some time on this the next couple weeks and ping when I have something.

TomAugspurger · 2016-11-22T14:10:02Z

Closing for now. Will reopen a new PR on top of a PR fixing #14711

API: CategoricalType for specifying categoricals

9b2d05f

Simple implementation for now.

TomAugspurger added API Design Categorical Categorical Data Type Dtype Conversions Unexpected or buggy dtype conversions labels Nov 20, 2016

TomAugspurger added this to the 0.20.0 milestone Nov 20, 2016

jreback requested changes Nov 20, 2016

View reviewed changes

TomAugspurger added 2 commits November 20, 2016 06:52

reuse CategoricalDtye

9777fcf

Series ctor

2a6a0e1

jreback reviewed Nov 21, 2016

View reviewed changes

pandas/types/dtypes.py

Parameters

----------

categories : list or None

Copy link

Contributor

jreback Nov 21, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

list-like, use similar to whats in Categorical now

jreback reviewed Nov 21, 2016

View reviewed changes

TomAugspurger closed this Nov 22, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP]API: CategoricalType for specifying categoricals #14698

[WIP]API: CategoricalType for specifying categoricals #14698

TomAugspurger commented Nov 20, 2016 •

edited

Loading

jreback Nov 20, 2016

TomAugspurger Nov 20, 2016

jreback Nov 20, 2016

TomAugspurger Nov 20, 2016

TomAugspurger commented Nov 20, 2016

jreback Nov 21, 2016

jreback Nov 21, 2016

jreback Nov 21, 2016

jreback Nov 21, 2016

jreback commented Nov 21, 2016

jreback commented Nov 21, 2016

TomAugspurger commented Nov 21, 2016

TomAugspurger commented Nov 22, 2016

[WIP]API: CategoricalType for specifying categoricals #14698

[WIP]API: CategoricalType for specifying categoricals #14698

Conversation

TomAugspurger commented Nov 20, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger commented Nov 20, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Nov 21, 2016

jreback commented Nov 21, 2016

TomAugspurger commented Nov 21, 2016

TomAugspurger commented Nov 22, 2016

TomAugspurger commented Nov 20, 2016 •

edited

Loading