-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API/ENH: union Categorical #13361
API/ENH: union Categorical #13361
Conversation
see #9927 |
came out of this post: http://stackoverflow.com/questions/29709918/pandas-and-category-replacement and originally some discussions with @mrocklin |
Current coverage is 84.23%@@ master #13361 diff @@
==========================================
Files 138 138
Lines 50724 50742 +18
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
+ Hits 42726 42744 +18
Misses 7998 7998
Partials 0 0
|
Yeah, I'm definitely guilty of mostly thinking of Categoricals as a "memory efficient string" type. If I understand your approach in that answer correct, this is what it would look like to concat? It's faster than re-categorizing the whole set of data, but slower than my In [50]: %%timeit
...:
...: from pandas.types.concat import _concat_categorical
...:
...: recoded = []
...: new_cats = []
...:
...: for i, c in enumerate([a_cat,b_cat,a_cat,b_cat,a_cat]):
...: if i == 0:
...: new_cats = c.categories
...: else:
...: new_cats = c.categories.union(new_cats)
...:
...: for c in [a_cat,b_cat,a_cat,b_cat,a_cat]:
...: recoded.append(pd.Categorical(c.get_values(), categories=new_cats))
...:
...: _concat_categorical(recoded)
1 loops, best of 3: 299 ms per loop |
Here is a much simpler method along the lines of what I had suggested
These are not exactly equal as the categories in the when concatting are in a different order (as they are seen differently; this is the 'expected' result). In fact the 'result' is the correct one here. The categories of all of 1 appear before all of 2. I am sure this could be optimized further as there are some checks in This relies on the fact that you don't need to reset the codes in 1 at all (just extend the categories). |
In fact I assert that this could be WAY faster all you have to do is this. we construct a union of the new categories (order actually doesn matter, though it happens to be in order). call this
Each of the cats that you want to union also has this mapping (but to different codes). All you have to do it map the original code -> new code. Concat then codes and you are done. This is essentially what I am doing above, but It could be a single loop, you can even do it like this.
Now you have a direct map from b_cat codes -> cat codes. |
Thanks. That's actually almost exactly what I'm doing, but my code is way more complex than it needs to be because I'm directly messing with the hash tables, rather than using the index set ops and Here's a MUCH simpler impl based on your ideas that gets to similar performance In [105]: to_union = [a_cat,b_cat,a_cat,b_cat,a_cat]
In [106]: %timeit union_categoricals(to_union)
10 loops, best of 3: 53.5 ms per loop
In [107]: def simple_union_cat(to_union):
...: for i, c in enumerate(to_union):
...: if i == 0:
...: cats = c.categories
...: else:
...: cats = cats.union(c.categories)
...: new_codes = []
...: for c in to_union:
...: indexer = cats.get_indexer(c.categories)
...: new_codes.append(indexer.take(c.codes))
...: codes = np.concatenate(new_codes)
...: return pd.Categorical.from_codes(codes, cats)
In [108]: %timeit simple_union_cat(to_union)
10 loops, best of 3: 73.2 ms per loop |
@chris-b1 great! I would move the API to |
I think instead of .union on the categories it makes sense to append in order (so make lists out of them as Union sorts) |
this should raise if any of the cats are ordered I think |
prob want an asv for this as well. |
lgtm. pls add a whatsnew entry. I think maybe a tiny doc-mention is |
@jreback - I updated this with a doc note |
I'm actually pretty far along (well, in terms of it basically working, not necessarily docs/tests/etc) on #10153, so if you'd prefer to review it all at once, I can push those changes onto this branch. group = ['aaaaaaaa', 'bbbbbbbb', 'cccccc', 'ddddddd', 'eeeeeeee']
df = pd.DataFrame({'a': np.random.choice(group, 10000000).astype('object'),
'b': np.random.choice(group, 10000000).astype('object'),
'c': np.random.choice(group, 10000000).astype('object')})
df.to_csv('strings.csv', index=False)
In [1]: %timeit df = pd.read_csv('strings.csv').apply(pd.Categorical);
1 loops, best of 3: 6.39 s per loop
In [2]: %timeit pd.read_csv('strings.csv', dtype='category')
1 loops, best of 3: 3.65 s per loop |
let's do that in a separate branch (u can do it on top of this one) |
|
||
for i, c in enumerate(to_union): | ||
if i == 0: | ||
cats = c.categories.tolist() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use numpy arrays (or pandas indexes) and concatenate them all at once rather than converting things into lists. Something like:
cats = to_union[0].categories
appended_cats = cats.append([c.categories for c in to_union[1:]])
unique_cats = appended_cats.unique()
In the worst case, suppose you have n
categoricals of fixed size, each of which has all unique values.
My version takes linear time (and space), whereas your version will require quadratic time -- n
index/hash table constructions of O(n)
elements each, on average.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you soln won't preseve order, though with a small mod I think it could, need to np.concatenate
the categories directly (that's why they were listifed in the first place). Index union doesn't preserve order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
though to be honest I can't imagine this is a perf issue. The entire point of this is that categories is << codes. But I suppose if it can be done with no additional complexity then it should.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @shoyer. I didn't know pd.unique
preserves order (numpy sorts), but seems to be the case. Is that an api guarantee?
In [12]: pd.Index(['b','c']).append(pd.Index(['a'])).unique()
Out[12]: array(['b', 'c', 'a'], dtype=object)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chris-b1 yes it IS a guarantee (and that's why its much faster than np.unique
)
Seems that people want Categoricals mostly as memory efficient strings :-)/:-( I like that it is not part of the categorical API but a helper function (see #9927 (comment)). |
yeah I think people are really wanting |
Alright, updated for those comments, let me know if you see anything else. Thanks for fastpath suggestion @JanSchulz, that ends up making a decent performance impact. |
Parameters | ||
---------- | ||
to_union : list like of Categorical | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add Raises (and list when that happens)
Unioning | ||
~~~~~~~~ | ||
|
||
If you want to combine categoricals that do not necessarily have |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
versionadded tag here
can u post timings? |
I also updated my comment at the top. Hard to say if this is "good," but 8x faster than the really naive approach, so that's something. In [10]: %timeit np.concatenate([a,b,a,b,a])
10 loops, best of 3: 82.7 ms per loop
In [12]: %timeit pd.Categorical(np.concatenate([a_cat.get_values(),b_cat.get_values(),
a_cat.get_values(),b_cat.get_values(),
a_cat.get_values()]))
1 loops, best of 3: 344 ms per loop
In [14]: %timeit union_categoricals([a_cat,b_cat,a_cat,b_cat,a_cat])
10 loops, best of 3: 40.9 ms per loop |
if any(c.ordered for c in to_union): | ||
raise TypeError("Can only combine unordered Categoricals") | ||
|
||
first = to_union[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should graceful catch the condition that the list of categoricals is empty.
xref #10186. We can fix |
thanks @chris-b1 nice enhancement! |
Quick follow-up question: do we really want to encourage users importing this from |
👍 Or |
I like the class method |
Why as a class method? As far as I remember we also don't encourage to use [Edit] Or is that analog to the "constructor" Another thought is that the (public) usecase for this is mainly to use categoricals as a memory efficient string type ("I don't care about the meaning of the categories, just concat them"), so I guess if ever there is a [edit] BTW: should that method take only |
@@ -201,6 +201,57 @@ def convert_categorical(x): | |||
return Categorical(concatted, rawcats) | |||
|
|||
|
|||
def union_categoricals(to_union): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a ignore_order=False
kwarg? That would prevent changed/copied orig categoricals if you ever need this... Would only disable the order check
if not ignore_order and any(c.ordered for c in to_union):
raise TypeError("Can only combine unordered Categoricals")
It would still return a unordered cat, of course.
actually, maybe we ought to just hijack |
so I like @JanSchulz |
At the moment,
So that would have to change to put this functionality in Also, if we expose this publicly, I think it should handle categorical Series objects as well (as this is what is encouraged to use, not Categoricals directly), as @JanSchulz also noted above. |
@jorisvandenbossche yes maybe that is too much overloading. |
@jreback @jorisvandenbossche So then would the concat workflow be like:
? |
that looks about right |
I was looking into #10153 (parsing Categoricals directly) and one thing that seems to be needed
for that is a good way to combine Categoricals. That part alone is complicated enough
so I decided to do a separate PR for it.
This adds a
union_categoricals
function that takes a list of (identical dtyped, unordered)and combines them without doing a full factorization again, unioning the categories.
This seems like it might be generically useful, does it belong the public api somewhere and/or
as a method a
Categorical
? Maybe as on option onconcat
?Might also be useful for dask? e.g. dask/dask#1040, cc @mrocklin
An example timing below. Obviously depends on the density of the categories,
but for relatively small numbers of categories, seems to generally be in 5-6x
speed up range vs concatting everything and re-categorizing.