-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API/ENH: union Categorical #13361
API/ENH: union Categorical #13361
Changes from all commits
ccaeb76
7b37c34
77e7963
4499cda
17209f9
568784f
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -201,6 +201,57 @@ def convert_categorical(x): | |
return Categorical(concatted, rawcats) | ||
|
||
|
||
def union_categoricals(to_union): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Add a if not ignore_order and any(c.ordered for c in to_union):
raise TypeError("Can only combine unordered Categoricals") It would still return a unordered cat, of course. |
||
""" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. add a versionadded tag |
||
Combine list-like of Categoricals, unioning categories. All | ||
must have the same dtype, and none can be ordered. | ||
|
||
.. versionadded 0.18.2 | ||
|
||
Parameters | ||
---------- | ||
to_union : list-like of Categoricals | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. add Raises (and list when that happens) |
||
Returns | ||
------- | ||
Categorical | ||
A single array, categories will be ordered as they | ||
appear in the list | ||
|
||
Raises | ||
------ | ||
TypeError | ||
If any of the categoricals are ordered or all do not | ||
have the same dtype | ||
ValueError | ||
Emmpty list of categoricals passed | ||
""" | ||
from pandas import Index, Categorical | ||
|
||
if len(to_union) == 0: | ||
raise ValueError('No Categoricals to union') | ||
|
||
first = to_union[0] | ||
if any(c.ordered for c in to_union): | ||
raise TypeError("Can only combine unordered Categoricals") | ||
|
||
if not all(com.is_dtype_equal(c.categories.dtype, first.categories.dtype) | ||
for c in to_union): | ||
raise TypeError("dtype of categories must be the same") | ||
|
||
cats = first.categories | ||
unique_cats = cats.append([c.categories for c in to_union[1:]]).unique() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. But if the tests pass, I am clearly confused. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The tests will (should!) pass, because I'm wrapping back in an index, and the non-numpy index types have There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. you should just and use pd.unique directly I think append is meant for 2 indexes only There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Actually In [74]: %timeit Index(a).append(Index(b)).drop_duplicates()
10 loops, best of 3: 197 ms per loop
In [75]: %timeit Index(Index(a).append(Index(b)).unique())
10 loops, best of 3: 71.5 ms per loop There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just to be clear it's a bit messier than that, There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think there is an issue for that - if not pls create one There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oops -- I just opened a new one: #13395 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ok close that new one - I think there is one about series as well - this is an old compat thing |
||
categories = Index(unique_cats) | ||
|
||
new_codes = [] | ||
for c in to_union: | ||
indexer = categories.get_indexer(c.categories) | ||
new_codes.append(indexer.take(c.codes)) | ||
codes = np.concatenate(new_codes) | ||
return Categorical(codes, categories=categories, ordered=False, | ||
fastpath=True) | ||
|
||
|
||
def _concat_datetime(to_concat, axis=0, typs=None): | ||
""" | ||
provide concatenation of an datetimelike array of arrays each of which is a | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -963,14 +963,40 @@ def assertNotIsInstance(obj, cls, msg=''): | |
|
||
|
||
def assert_categorical_equal(left, right, check_dtype=True, | ||
obj='Categorical'): | ||
obj='Categorical', check_category_order=True): | ||
"""Test that categoricals are eqivalent | ||
|
||
Parameters | ||
---------- | ||
left, right : Categorical | ||
Categoricals to compare | ||
check_dtype : bool, default True | ||
Check that integer dtype of the codes are the same | ||
obj : str, default 'Categorical' | ||
Specify object name being compared, internally used to show appropriate | ||
assertion message | ||
check_category_order : bool, default True | ||
Whether the order of the categories should be compared, which | ||
implies identical integer codes. If False, only the resulting | ||
values are compared. The ordered attribute is | ||
checked regardless. | ||
""" | ||
assertIsInstance(left, pd.Categorical, '[Categorical] ') | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you add a doc-string |
||
assertIsInstance(right, pd.Categorical, '[Categorical] ') | ||
|
||
assert_index_equal(left.categories, right.categories, | ||
obj='{0}.categories'.format(obj)) | ||
assert_numpy_array_equal(left.codes, right.codes, check_dtype=check_dtype, | ||
obj='{0}.codes'.format(obj)) | ||
if check_category_order: | ||
assert_index_equal(left.categories, right.categories, | ||
obj='{0}.categories'.format(obj)) | ||
assert_numpy_array_equal(left.codes, right.codes, | ||
check_dtype=check_dtype, | ||
obj='{0}.codes'.format(obj)) | ||
else: | ||
assert_index_equal(left.categories.sort_values(), | ||
right.categories.sort_values(), | ||
obj='{0}.categories'.format(obj)) | ||
assert_index_equal(left.categories.take(left.codes), | ||
right.categories.take(right.codes), | ||
obj='{0}.values'.format(obj)) | ||
|
||
assert_attr_equal('ordered', left, right, obj=obj) | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
versionadded tag here