-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make categories and ordered part of CategoricalDtype #14711
Comments
cc @JanSchulz (let me know if you want me to stop pinging you on this categorical issues 😄 ). Also @mrocklin I think this may interest you as the fastparquet stuff could use |
Will this also mean that [Thanks for pinging me!] |
Yeah, Dunno if we want to expand the |
yes, this was the original plan when I first did this, but too many moving parts. If this can be implemented in a back-compat manner (as this is pretty much an implementation detail). I think would be great. |
Specific for categorical, I suppose we would have a
Sounds OK
I would not allow this to start with. Is the idea that you can this way let the categories be inferred from the values, but set the ordered attribute through the dtype? Or what would be the application?
It seems that we currently replace with NA:
although I also think I would like a TypeError more as the default behaviour. |
@TomAugspurger How is this issue different from #14676, can that one be closed? |
That one will be a trivial change to .astype once this is fixed. I think leave it open as a reminder to implement it once the CategoricalDtype changes are done.
… On Nov 23, 2016, at 05:25, Joris Van den Bossche ***@***.***> wrote:
@TomAugspurger How is this issue different from #14676, can that one be closed?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Small update for future reference. I'm having trouble with the current approach where CategoricalDtype([1, 2]) to be different from CategoricalDtype([1., 2.]) but if we use a dictionary to do singleton stuff, the keys compare equal so we only ever get one. @jreback do you have any links to info on what-all requirements the numpy duck-type has to satisfy? |
so the cache would need to be changed to something like a
BUT the categories are an Index which are not hashable. But prob could do something like:
cc @mikegraham |
I think I may add an option to |
Since an index's Is it the case the Index order matters when the categorical is ordered=True and not when not? |
this is not competing with eq ; we actually use .equals to compare equivalents generally for an Index furthermore this by by default DOES consider ordering (a user could control this by index=False) |
@jreback great, thanks. Had to make a minor patch for a parameter that controls whether or not to categorize before hashing. We can't categorize in the CategoricalDtype constructor since we'll hit an infinite loop. diff --git a/pandas/tools/hashing.py b/pandas/tools/hashing.py
index 3ed51497d..94739b199 100644
--- a/pandas/tools/hashing.py
+++ b/pandas/tools/hashing.py
@@ -86,7 +86,8 @@ def hash_pandas_object(obj, index=True, encoding='utf8', hash_key=None,
return h
-def hash_array(vals, encoding='utf8', hash_key=None, reduce=False):
+def hash_array(vals, encoding='utf8', hash_key=None, reduce=False,
+ categorize_objects=True):
"""
Given a 1d array, return an array of deterministic integers.
@@ -100,6 +101,7 @@ def hash_array(vals, encoding='utf8', hash_key=None, reduce=False):
hash_key : string key to encode, default to _default_hash_key
reduce : boolean, default False
produce a single hash result
+ categorize_objects : bool
Returns
-------
@@ -137,16 +139,19 @@ def hash_array(vals, encoding='utf8', hash_key=None, reduce=False):
else:
# its MUCH faster to categorize object dtypes, then hash and rename
- codes, categories = factorize(vals, sort=False)
- categories = Index(categories)
- c = Series(Categorical(codes, categories,
- ordered=False, fastpath=True))
- vals = _hash.hash_object_array(categories.values,
- hash_key,
- encoding)
-
- # rename & extract
- vals = c.cat.rename_categories(Index(vals)).astype(np.uint64).values
+ if categorize_objects:
+ codes, categories = factorize(vals, sort=False)
+ categories = Index(categories)
+ c = Series(Categorical(codes, categories,
+ ordered=False, fastpath=True))
+ vals = _hash.hash_object_array(categories.values,
+ hash_key,
+ encoding)
+
+ # rename & extract
+ vals = c.cat.rename_categories(Index(vals)).astype(np.uint64).values
+ else:
+ vals = _hash.hash_object_array(np.asarray(vals), hash_key, encoding)
# Then, redistribute these 64-bit ints within the space of 64-bit ints
vals ^= vals >> 30 |
I must have taken too literally what you meant when you showed an index being used as part of a dict key. |
@TomAugspurger I wouldn't add like that. instead simply call |
@mikegraham yeah this is to 'hash' Index objects, so in theory it would work for this purpose, IOW, providing a data-hash. But I think we would need much more validation / testing if we actually wanted to replace the equiv of |
@jreback I read your code closer and I see what you've done now. I think this code might be a bit unsafe because of hash collisions. If this was a 160+ bit cryptographic hash, I could imagine feeling safe enough, but the hashing here.....I'm really not certain it's the right thing to do. I'd consider keying on something like
Obviously having the key and value have to have references to the index is annoying, but you can get weakref semantics with some work. |
As predicted by @mikegraham , here is a failure: In [1]: import numpy as np
In [2]: from pandas.tools.hashing import hash_array
In [3]: L = ['Ingrid-9Z9fKIZmkO7i7Cn51Li34pJm44fgX6DYGBNj3VPlOH50m7HnBlPxfIwFMrc
...: NJNMP6PSgLmwWnInciMWrCSAlLEvt7JkJl4IxiMrVbXSa8ZQoVaq5xoQPjltuJEfwdNlO6jo
...: 8qRRHvD8sBEBMQASrRa6TsdaPTPCBo3nwIBpE7YzzmyH0vMBhjQZLx1aCT7faSEx7PgFxQhH
...: dKFWROcysamgy9iVj8DO2Fmwg1NNl93rIAqC3mdqfrCxrzfvIY8aJdzin2cHVzy3QUJxZgHv
...: tUtOLxoqnUHsYbNTeq0xcLXpTZEZCxD4PGubIuCNf32c33M7HFsnjWSEjE2yVdWKhmSVodyF
...: 8hFYVmhYnMCztQnJrt3O8ZvVRXd5IKwlLexiSp4h888w7SzAIcKgc3g5XQJf6MlSMftDXm9l
...: IsE1mJNiJEv6uY6pgvC3fUPhatlR5JPpVAHNSbSEE73MBzJrhCAbOLXQumyOXigZuPoME7Qg
...: JcBalliQol7YZ9', 'Tim-b9MddTxOWW2AT1Py6vtVbZwGAmYCjbp89p8mxsiFoVX4FyDOF3
...: wFiAkyQTUgwg9sVqVYOZo09Dh1AzhFHbgij52ylF0SEwgzjzHH8TGY8Lypart4p4onnDoDvV
...: MBa0kdthVGKl6K0BDVGzyOXPXKpmnMF1H6rJzqHJ0HywfwS4XYpVwlAkoeNsiicHkJUFdUAh
...: G229INzvIAiJuAHeJDUoyO4DCBqtoZ5TDend6TK7Y914yHlfH3g1WZu5LksKv68VQHJriWFY
...: usW5e6ZZ6dKaMjTwEGuRgdT66iU5nqWTHRH8WSzpXoCFwGcTOwyuqPSe0fTe21DVtJn1FKj9
...: F9nEnR9xOvJUO7E0piCIF4Ad9yAIDY4DBimpsTfKXCu1vdHpKYerzbndfuFe5AhfMduLYZJi
...: 5iAw8qKSwR5h86ttXV0Mc0QmXz8dsRvDgxjXSmupPxBggdlqUlC828hXiTPD7am0yETBV0F3
...: bEtvPiNJfremszcV8NcqAoARMe']
In [4]: hash_array(np.asarray(L, dtype=object), 'utf8')
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-4-5f657fa82496> in <module>()
----> 1 hash_array(np.asarray(L, dtype=object), 'utf8')
/home/mrocklin/workspace/pandas/pandas/tools/hashing.py in hash_array(vals, encoding, hash_key)
127
128 # rename & extract
--> 129 vals = c.cat.rename_categories(Index(vals)).astype(np.uint64).values
130
131 # Then, redistribute these 64-bit ints within the space of 64-bit ints
/home/mrocklin/workspace/pandas/pandas/core/base.py in f(self, *args, **kwargs)
208
209 def f(self, *args, **kwargs):
--> 210 return self._delegate_method(name, *args, **kwargs)
211
212 f.__name__ = name
/home/mrocklin/workspace/pandas/pandas/core/categorical.py in _delegate_method(self, name, *args, **kwargs)
1959 from pandas import Series
1960 method = getattr(self.categorical, name)
-> 1961 res = method(*args, **kwargs)
1962 if res is not None:
1963 return Series(res, index=self.index)
/home/mrocklin/workspace/pandas/pandas/core/categorical.py in rename_categories(self, new_categories, inplace)
756 """
757 cat = self if inplace else self.copy()
--> 758 cat.categories = new_categories
759 if not inplace:
760 return cat
/home/mrocklin/workspace/pandas/pandas/core/categorical.py in _set_categories(self, categories, fastpath)
586 """
587
--> 588 categories = self._validate_categories(categories, fastpath=fastpath)
589 if (not fastpath and self._categories is not None and
590 len(categories) != len(self._categories)):
/home/mrocklin/workspace/pandas/pandas/core/categorical.py in _validate_categories(cls, categories, fastpath)
572
573 if not categories.is_unique:
--> 574 raise ValueError('Categorical categories must be unique')
575
576 return categories
ValueError: Categorical categories must be unique Both of these inputs hash down to the same value, which causes Pandas to correctly raise. |
@mikegraham do you have suggestions on the best course of action here? |
here is the test code. I think this might be some sort of overflow in siphash. |
Is siphash a cryptographic hash? If not then it should not surprise us to see collisions. |
so these are failing if len(string) is an exact power of 2, e.g. these area |
BTW, I had initially done some tests against externally-computed siphash values...we should probably get those into pandas' suite. I can work with you to do that. |
@mikegraham that would be great |
so #14804/#14805 this seems to fix the issues. thanks @mikegraham as an aside if you want to do a PR to lock-down the actual resultant hash values for a tests series would be great (so we match external siphash and don't change in the future) |
We extended the CategoricalDtype to accept optional categories and ordered argument. ```python pd.CategoricalDtype(categories=['a', 'b'], ordered=True ``` CategoricalDtype is now part of the public API. This allows users to specify the desired categories and orderedness of an operation ahead of time. The current behavior, which is still possible with categories=None, the default, is to infer the categories from whatever is present. This change will make it easy to implement support for specifying categories that are know ahead of time in other places e.g. .astype, .read_csv, and the Series constructor. Closes pandas-dev#14711 Closes pandas-dev#15078 Closes pandas-dev#14676
We extended the CategoricalDtype to accept optional categories and ordered argument. ```python pd.CategoricalDtype(categories=['a', 'b'], ordered=True ``` CategoricalDtype is now part of the public API. This allows users to specify the desired categories and orderedness of an operation ahead of time. The current behavior, which is still possible with categories=None, the default, is to infer the categories from whatever is present. This change will make it easy to implement support for specifying categories that are know ahead of time in other places e.g. .astype, .read_csv, and the Series constructor. Closes pandas-dev#14711 Closes pandas-dev#15078 Closes pandas-dev#14676
We extended the CategoricalDtype to accept optional categories and ordered argument. ```python pd.CategoricalDtype(categories=['a', 'b'], ordered=True ``` CategoricalDtype is now part of the public API. This allows users to specify the desired categories and orderedness of an operation ahead of time. The current behavior, which is still possible with categories=None, the default, is to infer the categories from whatever is present. This change will make it easy to implement support for specifying categories that are know ahead of time in other places e.g. .astype, .read_csv, and the Series constructor. Closes pandas-dev#14711 Closes pandas-dev#15078 Closes pandas-dev#14676
We extended the CategoricalDtype to accept optional categories and ordered argument. ```python pd.CategoricalDtype(categories=['a', 'b'], ordered=True ``` CategoricalDtype is now part of the public API. This allows users to specify the desired categories and orderedness of an operation ahead of time. The current behavior, which is still possible with categories=None, the default, is to infer the categories from whatever is present. This change will make it easy to implement support for specifying categories that are know ahead of time in other places e.g. .astype, .read_csv, and the Series constructor. Closes pandas-dev#14711 Closes pandas-dev#15078 Closes pandas-dev#14676
We extended the CategoricalDtype to accept optional categories and ordered argument. ```python pd.CategoricalDtype(categories=['a', 'b'], ordered=True ``` CategoricalDtype is now part of the public API. This allows users to specify the desired categories and orderedness of an operation ahead of time. The current behavior, which is still possible with categories=None, the default, is to infer the categories from whatever is present. This change will make it easy to implement support for specifying categories that are know ahead of time in other places e.g. .astype, .read_csv, and the Series constructor. Closes pandas-dev#14711 Closes pandas-dev#15078 Closes pandas-dev#14676
We extended the CategoricalDtype to accept optional categories and ordered argument. ```python pd.CategoricalDtype(categories=['a', 'b'], ordered=True ``` CategoricalDtype is now part of the public API. This allows users to specify the desired categories and orderedness of an operation ahead of time. The current behavior, which is still possible with categories=None, the default, is to infer the categories from whatever is present. This change will make it easy to implement support for specifying categories that are know ahead of time in other places e.g. .astype, .read_csv, and the Series constructor. Closes pandas-dev#14711 Closes pandas-dev#15078 Closes pandas-dev#14676
We extended the CategoricalDtype to accept optional categories and ordered argument. ```python pd.CategoricalDtype(categories=['a', 'b'], ordered=True ``` CategoricalDtype is now part of the public API. This allows users to specify the desired categories and orderedness of an operation ahead of time. The current behavior, which is still possible with categories=None, the default, is to infer the categories from whatever is present. This change will make it easy to implement support for specifying categories that are know ahead of time in other places e.g. .astype, .read_csv, and the Series constructor. Closes pandas-dev#14711 Closes pandas-dev#15078 Closes pandas-dev#14676
We extended the CategoricalDtype to accept optional categories and ordered argument. ```python pd.CategoricalDtype(categories=['a', 'b'], ordered=True ``` CategoricalDtype is now part of the public API. This allows users to specify the desired categories and orderedness of an operation ahead of time. The current behavior, which is still possible with categories=None, the default, is to infer the categories from whatever is present. This change will make it easy to implement support for specifying categories that are know ahead of time in other places e.g. .astype, .read_csv, and the Series constructor. Closes pandas-dev#14711 Closes pandas-dev#15078 Closes pandas-dev#14676
We extended the CategoricalDtype to accept optional categories and ordered argument. ```python pd.CategoricalDtype(categories=['a', 'b'], ordered=True ``` CategoricalDtype is now part of the public API. This allows users to specify the desired categories and orderedness of an operation ahead of time. The current behavior, which is still possible with categories=None, the default, is to infer the categories from whatever is present. This change will make it easy to implement support for specifying categories that are know ahead of time in other places e.g. .astype, .read_csv, and the Series constructor. Closes pandas-dev#14711 Closes pandas-dev#15078 Closes pandas-dev#14676
We extended the CategoricalDtype to accept optional categories and ordered argument. ```python pd.CategoricalDtype(categories=['a', 'b'], ordered=True ``` CategoricalDtype is now part of the public API. This allows users to specify the desired categories and orderedness of an operation ahead of time. The current behavior, which is still possible with categories=None, the default, is to infer the categories from whatever is present. This change will make it easy to implement support for specifying categories that are know ahead of time in other places e.g. .astype, .read_csv, and the Series constructor. Closes pandas-dev#14711 Closes pandas-dev#15078 Closes pandas-dev#14676
We extended the CategoricalDtype to accept optional categories and ordered argument. ```python pd.CategoricalDtype(categories=['a', 'b'], ordered=True ``` CategoricalDtype is now part of the public API. This allows users to specify the desired categories and orderedness of an operation ahead of time. The current behavior, which is still possible with categories=None, the default, is to infer the categories from whatever is present. This change will make it easy to implement support for specifying categories that are know ahead of time in other places e.g. .astype, .read_csv, and the Series constructor. Closes pandas-dev#14711 Closes pandas-dev#15078 Closes pandas-dev#14676
We extended the CategoricalDtype to accept optional categories and ordered argument. ```python pd.CategoricalDtype(categories=['a', 'b'], ordered=True ``` CategoricalDtype is now part of the public API. This allows users to specify the desired categories and orderedness of an operation ahead of time. The current behavior, which is still possible with categories=None, the default, is to infer the categories from whatever is present. This change will make it easy to implement support for specifying categories that are know ahead of time in other places e.g. .astype, .read_csv, and the Series constructor. Closes pandas-dev#14711 Closes pandas-dev#15078 Closes pandas-dev#14676
We extended the CategoricalDtype to accept optional categories and ordered argument. ```python pd.CategoricalDtype(categories=['a', 'b'], ordered=True ``` CategoricalDtype is now part of the public API. This allows users to specify the desired categories and orderedness of an operation ahead of time. The current behavior, which is still possible with categories=None, the default, is to infer the categories from whatever is present. This change will make it easy to implement support for specifying categories that are know ahead of time in other places e.g. .astype, .read_csv, and the Series constructor. Closes pandas-dev#14711 Closes pandas-dev#15078 Closes pandas-dev#14676
We extended the CategoricalDtype to accept optional categories and ordered argument. ```python pd.CategoricalDtype(categories=['a', 'b'], ordered=True ``` CategoricalDtype is now part of the public API. This allows users to specify the desired categories and orderedness of an operation ahead of time. The current behavior, which is still possible with categories=None, the default, is to infer the categories from whatever is present. This change will make it easy to implement support for specifying categories that are know ahead of time in other places e.g. .astype, .read_csv, and the Series constructor. Closes pandas-dev#14711 Closes pandas-dev#15078 Closes pandas-dev#14676
Closes pandas-dev#14711 Closes pandas-dev#15078 Closes pandas-dev#14676
Closes pandas-dev#14711 Closes pandas-dev#15078 Closes pandas-dev#14676
This is to discuss pushing the
Categorical.categories
andCategorical.ordered
information into the extension typeCategoricalDtype
.Note that there is no
values
argument. This is a type constructor, thatisn't attached to any specific
Categorical
instance.Why?
Several times now (
read_csv(..., dtype=...)
,.astype(...)
,Series([], dtype=...)
)we have places where we accept
dtype='category'
which takes the valuesin the method (the series, or column from the CSV, etc.)
and hands it off to the value constructor, with no control over the
categories
andordered
arguments.The proposal here would add the
categories
andordered
attributes / arguments to
CategoricalDtype
and provide a common APIfor specifying non-default parameters for the
Categorical
constructorin methods like
read_csv
,astype
, etc.We would continue to accept
dtype='category'
.This becomes even more import when doing operations on larger than memory datasets with something like
dask
or even (read_csv(..., chunksize=N)
). Right now you don't have an easy way to specify thecategories
orordered
for columns (assuming you know them ahead of time).Issues
CategoricalDtype
currently isn't part of the public API. Which methodson it do we make public?
of
CategoricalDtype
should compare equal with all others. You can useidentity to check if two types are the same
categories
argument be required? Currentlydtype='category'
says 1.) infer the categories based on the values, and 2.) it's unordered.
Would
CategoricalDtype(None, ordered=False)
be allowed?What happens? I would probably expect a
TypeError
orValueError
asc
isn't "supposed" to be there. Or do we replace
'c'
withNA
? Shouldstrict
be another parameter toCategoricalDtype
(I don't think so).I'm willing to work on this over the next couple weeks.
xref #14676 (astype)
xref #14503 (read_csv)
The text was updated successfully, but these errors were encountered: