API/ENH: union Categorical #13361

chris-b1 · 2016-06-04T12:03:11Z

I was looking into #10153 (parsing Categoricals directly) and one thing that seems to be needed
for that is a good way to combine Categoricals. That part alone is complicated enough
so I decided to do a separate PR for it.

This adds a union_categoricals function that takes a list of (identical dtyped, unordered)
and combines them without doing a full factorization again, unioning the categories.

This seems like it might be generically useful, does it belong the public api somewhere and/or
as a method a Categorical? Maybe as on option on concat?

Might also be useful for dask? e.g. dask/dask#1040, cc @mrocklin

An example timing below. Obviously depends on the density of the categories,
but for relatively small numbers of categories, seems to generally be in 5-6x
speed up range vs concatting everything and re-categorizing.

from pandas.types.concat import union_categoricals

group1 = ['aaaaa', 'bbbbb', 'cccccc', 'ddddddd', 'eeeeeeee']
group2 = group1[3:] + ['fffffff', 'gggggggg', 'ddddddddd']

a = np.random.choice(group1, 1000000).astype('object')
b = np.random.choice(group2, 1000000).astype('object')

a_cat, b_cat = pd.Categorical(a), pd.Categorical(b)

In [10]: %timeit np.concatenate([a,b,a,b,a])
10 loops, best of 3: 82.7 ms per loop

In [12]: %timeit pd.Categorical(np.concatenate([a_cat.get_values(),b_cat.get_values(),
                                                a_cat.get_values(),b_cat.get_values(),
                                                a_cat.get_values()]))
1 loops, best of 3: 344 ms per loop

In [14]: %timeit union_categoricals([a_cat,b_cat,a_cat,b_cat,a_cat])
10 loops, best of 3:  40.9 ms per loop

jreback · 2016-06-04T12:12:45Z

see #9927

jreback · 2016-06-04T12:19:44Z

came out of this post: http://stackoverflow.com/questions/29709918/pandas-and-category-replacement

and originally some discussions with @mrocklin

codecov-io · 2016-06-04T12:28:55Z

Current coverage is 84.23%

Merging #13361 into master will increase coverage by <.01%

@@             master     #13361   diff @@
==========================================
  Files           138        138          
  Lines         50724      50742    +18   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          42726      42744    +18   
  Misses         7998       7998          
  Partials          0          0

Powered by Codecov. Last updated by 103f7d3...ccaeb76

chris-b1 · 2016-06-04T12:57:27Z

Yeah, I'm definitely guilty of mostly thinking of Categoricals as a "memory efficient string" type.

If I understand your approach in that answer correct, this is what it would look like to concat? It's faster than re-categorizing the whole set of data, but slower than my union_categoricals because (I think) it has to re-hash every element (unless that's possible to avoid?), where I'm reusing the existing integer codes.

In [50]: %%timeit
    ...: 
    ...: from pandas.types.concat import _concat_categorical
    ...: 
    ...: recoded = []
    ...: new_cats = []
    ...: 
    ...: for i, c in enumerate([a_cat,b_cat,a_cat,b_cat,a_cat]):
    ...:     if i == 0:
    ...:         new_cats = c.categories
    ...:     else:
    ...:         new_cats = c.categories.union(new_cats)
    ...: 
    ...: for c in [a_cat,b_cat,a_cat,b_cat,a_cat]:
    ...:     recoded.append(pd.Categorical(c.get_values(), categories=new_cats))
    ...: 
    ...: _concat_categorical(recoded)
1 loops, best of 3: 299 ms per loop

jreback · 2016-06-04T14:01:25Z

Here is a much simpler method along the lines of what I had suggested

In [66]: def f():
   ....:     cats = Index(a_cat.categories.tolist() + b_cat.categories.difference(a_cat.categories).tolist())
   ....:     new_b = b_cat.set_categories(cats)
   ....:     return pd.Categorical.from_codes(np.concatenate([a_cat.codes,new_b.codes]),cats)
   ....: 

In [72]: %timeit pd.Categorical(np.concatenate([a_cat.get_values(),b_cat.get_values()]))
10 loops, best of 3: 110 ms per loop

n [81]: %timeit f()
10 loops, best of 3: 63.4 ms per loop

These are not exactly equal as the categories in the when concatting are in a different order (as they are seen differently; this is the 'expected' result). In fact the 'result' is the correct one here. The categories of all of 1 appear before all of 2.

I am sure this could be optimized further as there are some checks in set_categories that are unecessary for this purpose (that is very generic).

This relies on the fact that you don't need to reset the codes in 1 at all (just extend the categories).

jreback · 2016-06-04T14:18:05Z

In fact I assert that this could be WAY faster all you have to do is this.

we construct a union of the new categories (order actually doesn matter, though it happens to be in order). call this cats (which is what I construct above)
we know that the codes are going to be

In [101]: dict(zip(range(len(cats)),cats))
Out[101]: 
{0: 'aaaaa',
 1: 'bbbbb',
 2: 'cccccc',
 3: 'ddddddd',
 4: 'eeeeeeee',
 5: 'ddddddddd',
 6: 'fffffff',
 7: 'gggggggg'}

Each of the cats that you want to union also has this mapping (but to different codes). All you have to do it map the original code -> new code. Concat then codes and you are done. This is essentially what I am doing above, but It could be a single loop, you can even do it like this.

In [129]: indexer = cats.get_indexer(b_cat.categories)

In [130]: indexer
Out[130]: array([3, 5, 4, 6, 7])

In [131]: new_indexer = np.empty(len(cats),dtype=int)

In [132]: new_indexer.fill(-1)

In [133]: new_indexer[-len(b_cat.categories):] = indexer

In [134]: new_indexer
Out[134]: array([-1, -1, -1,  3,  5,  4,  6,  7])

Now you have a direct map from b_cat codes -> cat codes.

chris-b1 · 2016-06-04T15:33:41Z

Thanks. That's actually almost exactly what I'm doing, but my code is way more complex than it needs to be because I'm directly messing with the hash tables, rather than using the index set ops and get_indexer.

Here's a MUCH simpler impl based on your ideas that gets to similar performance

In [105]: to_union = [a_cat,b_cat,a_cat,b_cat,a_cat]

In [106]: %timeit union_categoricals(to_union)
10 loops, best of 3: 53.5 ms per loop

In [107]: def simple_union_cat(to_union):
     ...:     for i, c in enumerate(to_union):
     ...:         if i == 0:
     ...:             cats = c.categories
     ...:         else:
     ...:             cats = cats.union(c.categories)
     ...:     new_codes = []
     ...:     for c in to_union:
     ...:         indexer = cats.get_indexer(c.categories)
     ...:         new_codes.append(indexer.take(c.codes))
     ...:     codes = np.concatenate(new_codes)
     ...:     return pd.Categorical.from_codes(codes, cats)

In [108]: %timeit simple_union_cat(to_union)
10 loops, best of 3: 73.2 ms per loop

jreback · 2016-06-04T15:35:44Z

@chris-b1 great!

I would move the API to core/categorical.py (or maybe types/concat.py)

jreback · 2016-06-04T15:38:11Z

I think instead of .union on the categories it makes sense to append in order (so make lists out of them as Union sorts)

jreback · 2016-06-04T15:39:25Z

this should raise if any of the cats are ordered I think

jreback · 2016-06-04T16:36:10Z

prob want an asv for this as well.

jreback · 2016-06-05T14:01:31Z

lgtm. pls add a whatsnew entry. I think maybe a tiny doc-mention is categorical.rst would be good as well.

chris-b1 · 2016-06-05T19:00:09Z

@jreback - I updated this with a doc note

jreback · 2016-06-05T19:02:26Z

cc @JanSchulz
@jorisvandenbossche

chris-b1 · 2016-06-06T00:10:12Z

I'm actually pretty far along (well, in terms of it basically working, not necessarily docs/tests/etc) on #10153, so if you'd prefer to review it all at once, I can push those changes onto this branch.

group = ['aaaaaaaa', 'bbbbbbbb', 'cccccc', 'ddddddd', 'eeeeeeee']
df = pd.DataFrame({'a': np.random.choice(group, 10000000).astype('object'),
                   'b': np.random.choice(group, 10000000).astype('object'),
                   'c': np.random.choice(group, 10000000).astype('object')})
df.to_csv('strings.csv', index=False)


In [1]: %timeit df = pd.read_csv('strings.csv').apply(pd.Categorical);
1 loops, best of 3: 6.39 s per loop

In [2]: %timeit pd.read_csv('strings.csv', dtype='category')
1 loops, best of 3: 3.65 s per loop

jreback · 2016-06-06T00:11:33Z

let's do that in a separate branch (u can do it on top of this one)

shoyer · 2016-06-06T03:00:28Z

pandas/types/concat.py

+
+    for i, c in enumerate(to_union):
+        if i == 0:
+            cats = c.categories.tolist()


Use numpy arrays (or pandas indexes) and concatenate them all at once rather than converting things into lists. Something like:

cats = to_union[0].categories appended_cats = cats.append([c.categories for c in to_union[1:]]) unique_cats = appended_cats.unique()

In the worst case, suppose you have n categoricals of fixed size, each of which has all unique values.

My version takes linear time (and space), whereas your version will require quadratic time -- n index/hash table constructions of O(n) elements each, on average.

you soln won't preseve order, though with a small mod I think it could, need to np.concatenate the categories directly (that's why they were listifed in the first place). Index union doesn't preserve order.

though to be honest I can't imagine this is a perf issue. The entire point of this is that categories is << codes. But I suppose if it can be done with no additional complexity then it should.

Thanks @shoyer. I didn't know pd.unique preserves order (numpy sorts), but seems to be the case. Is that an api guarantee?

In [12]: pd.Index(['b','c']).append(pd.Index(['a'])).unique() Out[12]: array(['b', 'c', 'a'], dtype=object)

@chris-b1 yes it IS a guarantee (and that's why its much faster than np.unique)

jankatins · 2016-06-06T14:11:36Z

Seems that people want Categoricals mostly as memory efficient strings :-)/:-( I like that it is not part of the categorical API but a helper function (see #9927 (comment)).

jreback · 2016-06-06T14:23:10Z

yeah I think people are really wanting pd.String, but pd.Categorical is working (mostly)

chris-b1 · 2016-06-07T02:52:58Z

Alright, updated for those comments, let me know if you see anything else. Thanks for fastpath suggestion @JanSchulz, that ends up making a decent performance impact.

jreback · 2016-06-07T12:32:56Z

pandas/types/concat.py

+    Parameters
+    ----------
+    to_union : list like of Categorical
+


add Raises (and list when that happens)

chris-b1 · 2016-06-07T23:21:49Z

@shoyer, @jreback - updated for your comments

jreback · 2016-06-07T23:34:26Z

doc/source/categorical.rst

+Unioning
+~~~~~~~~
+
+If you want to combine categoricals that do not necessarily have


versionadded tag here

jreback · 2016-06-07T23:35:10Z

can u post timings?

chris-b1 · 2016-06-07T23:41:08Z

I also updated my comment at the top. Hard to say if this is "good," but 8x faster than the really naive approach, so that's something.

In [10]: %timeit np.concatenate([a,b,a,b,a])
10 loops, best of 3: 82.7 ms per loop

In [12]: %timeit pd.Categorical(np.concatenate([a_cat.get_values(),b_cat.get_values(),
                                                a_cat.get_values(),b_cat.get_values(),
                                                a_cat.get_values()]))
1 loops, best of 3: 344 ms per loop

In [14]: %timeit union_categoricals([a_cat,b_cat,a_cat,b_cat,a_cat])
10 loops, best of 3:  40.9 ms per loop

shoyer · 2016-06-08T00:12:31Z

pandas/types/concat.py

+    if any(c.ordered for c in to_union):
+        raise TypeError("Can only combine unordered Categoricals")
+
+    first = to_union[0]


You should graceful catch the condition that the list of categoricals is empty.

sinhrks · 2016-06-08T00:20:20Z

xref #10186. We can fix CategoricalIndex.union as the same after this PR.

jreback · 2016-06-08T11:32:50Z

thanks @chris-b1 nice enhancement!

shoyer · 2016-06-09T02:09:45Z

Quick follow-up question: do we really want to encourage users importing this from pandas.types.concat? If we don't want to expose that implementation detail as API, we probably should export it to the main namespace.

max-sixty · 2016-06-09T02:14:09Z

Quick follow-up question: do we really want to encourage users importing this from pandas.types.concat? If we don't want to expose that implementation detail as API, we probably should export it to the main namespace.

👍

Or pd.Categorical.union(*cats) or similar

shoyer · 2016-06-09T02:27:47Z

I like the class method pd.Categorical.union(cats)

jankatins · 2016-06-09T09:11:14Z

Why as a class method? As far as I remember we also don't encourage to use Categorical directly but usually as s.astype("category") -> pd.union_categorical(a, b)? [Edit] ok, this doesn't work because the functions does only take Categorical and not Series of type category[/]

[Edit] Or is that analog to the "constructor" Categorical.from_codes(), like Categorical.from_union(*cats)?

Another thought is that the (public) usecase for this is mainly to use categoricals as a memory efficient string type ("I don't care about the meaning of the categories, just concat them"), so I guess if ever there is a pd.String, this function is then mostly unneeded (the internal usecase in reading in cats directly shouldn't need a public API). From that perspective a pandas.types.concat.union_categorical(a,b) is also fine...

[edit]
My new favorite is Categorical.from_union().

BTW: should that method take only Categoricals or also Series of type cat?

jankatins · 2016-06-09T09:28:42Z

pandas/types/concat.py

@@ -201,6 +201,57 @@ def convert_categorical(x):
        return Categorical(concatted, rawcats)


+def union_categoricals(to_union):


Add a ignore_order=False kwarg? That would prevent changed/copied orig categoricals if you ever need this... Would only disable the order check

if not ignore_order and any(c.ordered for c in to_union): raise TypeError("Can only combine unordered Categoricals")

It would still return a unordered cat, of course.

jreback · 2016-06-09T11:41:19Z

actually, maybe we ought to just hijack pd.concat, this could be a fast-path that just calls union_categoricals if all of the inputs are Categoricals in the first place. From an API perspective it feels right, there is a little bit of added code complexity inside Concat though.

jreback · 2016-06-09T11:42:39Z

so I like @JanSchulz Categorical.from_union() as well

jorisvandenbossche · 2016-06-09T12:22:42Z

At the moment, pd.concat raises if there are different categories:

In [58]: s1 = pd.Series(list('abcab'), dtype='category')

In [59]: s2 = pd.Series(list('cbdcb'), dtype='category')

In [60]: pd.concat([s1, s2])
ValueError: incompatible categories in categorical concat

So that would have to change to put this functionality in pd.concat. @jreback Is that what you were thinking of? Or provide a keyword in concat to ignore incompatible categories? (although I don't think concat should get a keyword specifically for categoricals).
Personally I am not in favor of changing the behaviour of pd.concat.

Also, if we expose this publicly, I think it should handle categorical Series objects as well (as this is what is encouraged to use, not Categoricals directly), as @JanSchulz also noted above.

jreback · 2016-06-09T12:46:46Z

@jorisvandenbossche yes maybe that is too much overloading.

makmanalp · 2017-03-17T00:52:33Z

@jreback @jorisvandenbossche So then would the concat workflow be like:

Load df1 and df2
Union all categoricals from df1 and df2
Run astype(new_categorical) on df1 and df2
Concatenate df1 and df2

?

jreback · 2017-03-17T01:04:23Z

that looks about right
but can u show an example?

API/ENH: union Categorical

ccaeb76

jreback added Enhancement Categorical Categorical Data Type labels Jun 4, 2016

cleanup impl, add asv

7b37c34

jreback added this to the 0.18.2 milestone Jun 5, 2016

doc notes

77e7963

shoyer reviewed Jun 6, 2016
View reviewed changes

move tests, adress feedback

4499cda

jreback changed the title ~~WIP/API/ENH: union Categorical~~ API/ENH: union Categorical Jun 7, 2016

jreback reviewed Jun 7, 2016
View reviewed changes

pandas/types/concat.py

Parameters

----------

to_union : list like of Categorical

Copy link

Contributor

jreback Jun 7, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add Raises (and list when that happens)

Doc updates; use Index.append

17209f9

jreback reviewed Jun 7, 2016
View reviewed changes

shoyer reviewed Jun 8, 2016
View reviewed changes

versionadded; empty case

568784f

jreback closed this in d5bea25 Jun 8, 2016

jreback mentioned this pull request Jun 8, 2016

The Joy and Agony of Categorical Types dask/dask#1040

Closed

chris-b1 deleted the union-categorical branch June 8, 2016 22:34

jankatins reviewed Jun 9, 2016
View reviewed changes

This was referenced Jun 9, 2016

ENH: Categorical.from_union #13410

Closed

Appending categorical data should be more flexible #12699

Closed

adbull mentioned this pull request Jun 23, 2016

API/ENH: unprotected Categorical #13506

Closed

		@@ -201,6 +201,57 @@ def convert_categorical(x):
		return Categorical(concatted, rawcats)


		def union_categoricals(to_union):

API/ENH: union Categorical #13361

API/ENH: union Categorical #13361

Conversation

chris-b1 commented Jun 4, 2016 • edited Loading

jreback commented Jun 4, 2016

jreback commented Jun 4, 2016

codecov-io commented Jun 4, 2016 • edited Loading

Current coverage is 84.23%

chris-b1 commented Jun 4, 2016 • edited Loading

jreback commented Jun 4, 2016

jreback commented Jun 4, 2016

chris-b1 commented Jun 4, 2016

jreback commented Jun 4, 2016 • edited Loading

jreback commented Jun 4, 2016

jreback commented Jun 4, 2016

jreback commented Jun 4, 2016

jreback commented Jun 5, 2016

chris-b1 commented Jun 5, 2016

jreback commented Jun 5, 2016

chris-b1 commented Jun 6, 2016

jreback commented Jun 6, 2016

shoyer Jun 6, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jankatins commented Jun 6, 2016 • edited Loading

jreback commented Jun 6, 2016

chris-b1 commented Jun 7, 2016

Choose a reason for hiding this comment

chris-b1 commented Jun 7, 2016

Choose a reason for hiding this comment

jreback commented Jun 7, 2016

chris-b1 commented Jun 7, 2016

Choose a reason for hiding this comment

sinhrks commented Jun 8, 2016

jreback commented Jun 8, 2016

shoyer commented Jun 9, 2016

max-sixty commented Jun 9, 2016

shoyer commented Jun 9, 2016

jankatins commented Jun 9, 2016 • edited Loading

Choose a reason for hiding this comment

jreback commented Jun 9, 2016

jreback commented Jun 9, 2016

jorisvandenbossche commented Jun 9, 2016

jreback commented Jun 9, 2016

makmanalp commented Mar 17, 2017

jreback commented Mar 17, 2017

chris-b1 commented Jun 4, 2016 •

edited

Loading

codecov-io commented Jun 4, 2016 •

edited

Loading

chris-b1 commented Jun 4, 2016 •

edited

Loading

jreback commented Jun 4, 2016 •

edited

Loading

shoyer Jun 6, 2016 •

edited

Loading

jankatins commented Jun 6, 2016 •

edited

Loading

jankatins commented Jun 9, 2016 •

edited

Loading