Concat on non-identical categories in categorical indexes raises TypeError #17629

P-Tillmann · 2017-09-22T14:01:27Z

Code Sample, a copy-pastable example if possible

import pandas as pd
a = pd.DataFrame({'one':['foo','bar','bar'],'two':[1,2,3],'three':[1,3,4]})
b = pd.DataFrame({'one':['baz','bar','plob'],'two':[2,2,1],'three':[5,3,1]})

a['one'] = a['one'].astype('category')
b['one'] = b['one'].astype('category')
c = a.set_index('one')
d = b.set_index('one')
print(pd.concat([a, b]))
    one  three  two
0   foo      1    1
1   bar      3    2
2   bar      4    3
0   baz      5    2
1   bar      3    2
2  plob      1    1
print(pd.concat([c, d]))
raise TypeError("categories must match existing categories "
TypeError: categories must match existing categories when appending

Problem description

While concatenating categorical data with non-identical categories is now support for data columns by upcasting the data to an appropriate dtype it is still throwing an error when attempting to do the same thing with categorical indexes.
I think it would be convenient for users to implement the same behaviour for data columns and indexes.

Expected Output

print(pd.concat([c, d]))
      three  two
one             
foo       1    1
bar       3    2
bar       4    3
baz       5    2
bar       3    2
plob      1    1
print(pd.concat([c, d]).index.dtype)
dtype('O')

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.6.2.final.0
python-bits: 64
OS: Darwin
OS-release: 15.6.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: en_GB.UTF-8
LANG: en_GB.UTF-8
LOCALE: en_GB.UTF-8

pandas: 0.20.3
pytest: 3.2.1
pip: None
setuptools: None
Cython: 0.26
numpy: 1.13.1
scipy: 0.19.1
xarray: None
IPython: 6.1.0
sphinx: 1.6.3
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: None
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.4.8
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 0.9.8
lxml: 3.8.0
bs4: 4.6.0
html5lib: 0.999
sqlalchemy: 1.1.13
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

jreback · 2017-09-22T15:20:00Z

hmm that seems like a bug, it should work the same. can you submit a PR?

jreback · 2017-09-22T15:20:17Z

cc @chris-b1 @TomAugspurger

TomAugspurger · 2017-09-22T15:23:43Z

Related to #14177 (that's about unioning the columns).

We should do this for both at the same time.

jreback · 2017-09-22T15:25:19Z

right but we are now correctly upcasting to object dtype for the columns. This should still work on the indexes now.

#14177 is above providing an option for this (rather than just do the upcast).

P-Tillmann · 2017-09-22T15:46:32Z

I'll take a deeper look and see if i can put something together to fix this and report back if i' running into problems.

postelrich · 2018-02-27T20:00:23Z

Was this fixed? Ran into the same error using dd.concat in dask with pandas 0.22.0.

cc @TomAugspurger

TomAugspurger · 2018-02-27T20:14:37Z

Still an issue in pandas.

@postelrich was this a recent version of dask? dask/dask#2963 fixed an issue with dd.concat, but dask differs from pandas by unioning the categories before concatenating.

jbrockmendel · 2020-11-01T18:54:59Z

@jorisvandenbossche i think this is related to the discussion a few months ago about _concat_same_type vs _concat_same_dtype. IIRC you had a nice solution to that. Is there anything from that we can use here?

jorisvandenbossche · 2020-11-03T15:40:38Z

Sorry, I don't recall which discussion you are referring to. Do you have a link?
I removed _concat_same_dtype a while ago (#34198), to use the general concat logic instead (which under the hood uses _concat_same_type in case of EAs).

I think the main issue here is that CategoricalIndex still overrides the base class _concat with different behaviour, giving this error instead of converting to object dtype (what the base class does).
But so we need to decide if we want to change this behaviour. It's not only the case of pd.concat shown above, but eg also CategoricalIndex.append.

mroeschke · 2021-06-12T04:41:56Z

This looks to work on master now. Could use a test.

In [47]: print(pd.concat([c, d]))
      two  three
one
foo     1      1
bar     2      3
bar     3      4
baz     2      5
bar     2      3
plob    1      1

In [48]: print(pd.concat([c, d]).index.dtype)
    ...: dtype('O')
object

jreback added Categorical Categorical Data Type Dtype Conversions Unexpected or buggy dtype conversions Reshaping Concat, Merge/Join, Stack/Unstack, Explode Difficulty Intermediate labels Sep 22, 2017

jreback added this to the Next Major Release milestone Sep 22, 2017

jbrockmendel removed Effort Medium labels Oct 21, 2019

mroeschke added the Bug label Jun 28, 2020

mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Bug Categorical Categorical Data Type Dtype Conversions Unexpected or buggy dtype conversions Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Jun 12, 2021

P-Tillmann mentioned this issue Jun 28, 2021

TST: Added test for upcasting to object when concatinating on categorical indexes with non-identical categories #42278

Merged

4 tasks

jreback modified the milestones: Contributions Welcome, 1.4 Jul 1, 2021

jreback added the Categorical Categorical Data Type label Jul 1, 2021

jreback added the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label Jul 1, 2021

jreback closed this as completed in #42278 Jul 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Concat on non-identical categories in categorical indexes raises TypeError #17629

Concat on non-identical categories in categorical indexes raises TypeError #17629

P-Tillmann commented Sep 22, 2017

INSTALLED VERSIONS

jreback commented Sep 22, 2017

jreback commented Sep 22, 2017

TomAugspurger commented Sep 22, 2017

jreback commented Sep 22, 2017

P-Tillmann commented Sep 22, 2017

postelrich commented Feb 27, 2018

TomAugspurger commented Feb 27, 2018

jbrockmendel commented Nov 1, 2020

jorisvandenbossche commented Nov 3, 2020

mroeschke commented Jun 12, 2021

Concat on non-identical categories in categorical indexes raises TypeError #17629

Concat on non-identical categories in categorical indexes raises TypeError #17629

Comments

P-Tillmann commented Sep 22, 2017

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

jreback commented Sep 22, 2017

jreback commented Sep 22, 2017

TomAugspurger commented Sep 22, 2017

jreback commented Sep 22, 2017

P-Tillmann commented Sep 22, 2017

postelrich commented Feb 27, 2018

TomAugspurger commented Feb 27, 2018

jbrockmendel commented Nov 1, 2020

jorisvandenbossche commented Nov 3, 2020

mroeschke commented Jun 12, 2021

Output of `pd.show_versions()`