Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concat on non-identical categories in categorical indexes raises TypeError #17629

Closed
P-Tillmann opened this issue Sep 22, 2017 · 10 comments · Fixed by #42278
Closed

Concat on non-identical categories in categorical indexes raises TypeError #17629

P-Tillmann opened this issue Sep 22, 2017 · 10 comments · Fixed by #42278
Labels
Categorical Categorical Data Type good first issue Needs Tests Unit test(s) needed to prevent regressions Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Milestone

Comments

@P-Tillmann
Copy link
Contributor

Code Sample, a copy-pastable example if possible

import pandas as pd
a = pd.DataFrame({'one':['foo','bar','bar'],'two':[1,2,3],'three':[1,3,4]})
b = pd.DataFrame({'one':['baz','bar','plob'],'two':[2,2,1],'three':[5,3,1]})

a['one'] = a['one'].astype('category')
b['one'] = b['one'].astype('category')
c = a.set_index('one')
d = b.set_index('one')
print(pd.concat([a, b]))
    one  three  two
0   foo      1    1
1   bar      3    2
2   bar      4    3
0   baz      5    2
1   bar      3    2
2  plob      1    1
print(pd.concat([c, d]))
raise TypeError("categories must match existing categories "
TypeError: categories must match existing categories when appending

Problem description

While concatenating categorical data with non-identical categories is now support for data columns by upcasting the data to an appropriate dtype it is still throwing an error when attempting to do the same thing with categorical indexes.
I think it would be convenient for users to implement the same behaviour for data columns and indexes.

Expected Output

print(pd.concat([c, d]))
      three  two
one             
foo       1    1
bar       3    2
bar       4    3
baz       5    2
bar       3    2
plob      1    1
print(pd.concat([c, d]).index.dtype)
dtype('O')

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.2.final.0
python-bits: 64
OS: Darwin
OS-release: 15.6.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: en_GB.UTF-8
LANG: en_GB.UTF-8
LOCALE: en_GB.UTF-8

pandas: 0.20.3
pytest: 3.2.1
pip: None
setuptools: None
Cython: 0.26
numpy: 1.13.1
scipy: 0.19.1
xarray: None
IPython: 6.1.0
sphinx: 1.6.3
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: None
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.4.8
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 0.9.8
lxml: 3.8.0
bs4: 4.6.0
html5lib: 0.999
sqlalchemy: 1.1.13
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None

@jreback
Copy link
Contributor

jreback commented Sep 22, 2017

hmm that seems like a bug, it should work the same. can you submit a PR?

@jreback
Copy link
Contributor

jreback commented Sep 22, 2017

cc @chris-b1 @TomAugspurger

@jreback jreback added Categorical Categorical Data Type Dtype Conversions Unexpected or buggy dtype conversions Reshaping Concat, Merge/Join, Stack/Unstack, Explode Difficulty Intermediate labels Sep 22, 2017
@jreback jreback added this to the Next Major Release milestone Sep 22, 2017
@TomAugspurger
Copy link
Contributor

Related to #14177 (that's about unioning the columns).

We should do this for both at the same time.

@jreback
Copy link
Contributor

jreback commented Sep 22, 2017

right but we are now correctly upcasting to object dtype for the columns. This should still work on the indexes now.

#14177 is above providing an option for this (rather than just do the upcast).

@P-Tillmann
Copy link
Contributor Author

I'll take a deeper look and see if i can put something together to fix this and report back if i' running into problems.

@postelrich
Copy link

Was this fixed? Ran into the same error using dd.concat in dask with pandas 0.22.0.

cc @TomAugspurger

@TomAugspurger
Copy link
Contributor

Still an issue in pandas.

@postelrich was this a recent version of dask? dask/dask#2963 fixed an issue with dd.concat, but dask differs from pandas by unioning the categories before concatenating.

@jbrockmendel
Copy link
Member

@jorisvandenbossche i think this is related to the discussion a few months ago about _concat_same_type vs _concat_same_dtype. IIRC you had a nice solution to that. Is there anything from that we can use here?

@jorisvandenbossche
Copy link
Member

Sorry, I don't recall which discussion you are referring to. Do you have a link?
I removed _concat_same_dtype a while ago (#34198), to use the general concat logic instead (which under the hood uses _concat_same_type in case of EAs).

I think the main issue here is that CategoricalIndex still overrides the base class _concat with different behaviour, giving this error instead of converting to object dtype (what the base class does).
But so we need to decide if we want to change this behaviour. It's not only the case of pd.concat shown above, but eg also CategoricalIndex.append.

@mroeschke
Copy link
Member

This looks to work on master now. Could use a test.

In [47]: print(pd.concat([c, d]))
      two  three
one
foo     1      1
bar     2      3
bar     3      4
baz     2      5
bar     2      3
plob    1      1

In [48]: print(pd.concat([c, d]).index.dtype)
    ...: dtype('O')
object

@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Bug Categorical Categorical Data Type Dtype Conversions Unexpected or buggy dtype conversions Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Jun 12, 2021
@jreback jreback modified the milestones: Contributions Welcome, 1.4 Jul 1, 2021
@jreback jreback added the Categorical Categorical Data Type label Jul 1, 2021
@jreback jreback added the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label Jul 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type good first issue Needs Tests Unit test(s) needed to prevent regressions Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
7 participants