-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP]API: CategoricalType for specifying categoricals #14698
Conversation
Simple implementation for now.
|
||
|
||
class CategoricalType(CategoricalDtype): | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be in with the rest of the dtypes
to be honest i wouldn't create this; just add it optionally into the existing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that OK with the caching and stuff that is done on CategoricalDType
? This is on CategoricalyDtype
:
def __new__(cls):
try:
return cls._cache[cls.name]
except KeyError:
c = object.__new__(cls)
cls._cache[cls.name] = c
return c
I haven't messed with extension types much. We could make the keys of that internal dict reflect the categories and ordered attributes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes you would had optional attributes and cache based on them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
K, this seems to be working. Just have to convert the categories to a tuple so that they're hashable. Thanks.
What would the ideal Also, for equality semantics, this is what I have right now An instance of ``CategoricalDtype`` compares equal with any other
instance of ``CategoricalDtype``, regardless of categories or ordered.
In addition they compare equal to the string ``'category'``.
To check whether two instances of a ``CategoricalDtype`` match,
use the ``is`` operator.
>>> t1 = CategoricalDtype(['a', 'b'], ordered=True)
>>> t2 = CategoricalDtype(['a', 'c'], ordered=False)
>>> t1 == t2
True
>>> t1 == 'category'
True
>>> t1 is t2
False
>>> t1 is CategoricalDtype(['a', 'b'], ordered=True)
True though I don't expect people to be working with these objects that much. |
Type for categorical data with the categories and orderedness, | ||
but not the values | ||
|
||
.. versionadded:: 0.20.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this has been around for quite some time, you are adding parameter support in 0.20.0
|
||
Parameters | ||
---------- | ||
categories : list or None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
list-like, use similar to whats in Categorical now
|
||
Examples | ||
-------- | ||
>>> t = CategoricalDtype(categories=['b', 'a'], ordered=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these examples are not relevant. This should be not include Series. This is a self-contained type.
name = 'category' | ||
type = CategoricalDtypeType | ||
kind = 'O' | ||
str = '|O08' | ||
base = np.dtype('O') | ||
_cache = {} | ||
|
||
def __new__(cls): | ||
def __new__(cls, categories=None, ordered=False): | ||
categories_ = categories if categories is None else tuple(categories) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this needs all of the validation logic (from Categorical). on the actual categories.
you need to move over the repr (of the categories) and the validation logic on the categories to this class. Further Categorical should use this internally for storage of categories / ordered. This is a fairly invasive change and needs some thought. |
I would split off the actual issue change and make that a follow up PR. Just putting in place the correct infrastructure can be the scope of this PR> |
Yep. I had hoped to do the minimal change of just providing a new API, but let's do it right the first time. I'll spend some time on this the next couple weeks and ping when I have something. |
Closing for now. Will reopen a new PR on top of a PR fixing #14711 |
closes #14676
This adds a new top-level class
pd.CategoricalType
for representing the type information on aCategorical
, without the values. Users can pass this anywhere they would previously pass'category'
if they need to specify non-default values forcategories
orordered
(you can still use'category'
of course):This is the simplest possible change for now. A bigger change would be to make
c.dtype
return aCategoricalType
instead of 'category
'. We could probably do that in a way that's backwards compatible, but I'll need to think on it a bit.Other places this should work
select_dtypes
(currently returns all categoricals; should just be ones with sameCategoricalType
?)read_csv
: need to implement, see API: Expand read_csv dtype for categoricals #14503Series(..., dtype=...
)Implementation-wise I need to document, more tests, clean up some things like the repr. But I wanted to get this out there for discussion. @JanSchulz you might be interested.