-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Appending Pandas dataframes in for loop results in ValueError #13524
Comments
In [6]: s1 = pd.Series(['a', 'b']).astype('category')
In [7]: s1.append(pd.Series(['b', 'c']).astype('category'))
...
ValueError: incompatible categories in categorical concat I believe your code would work if you change the |
If you change your example code slightly so there are no NEW categories being added: In [6]: s1 = pd.Series(['a', 'b']).astype('category')
In [7]: s1.append(pd.Series(['b', 'a']).astype('category')) then it runs OK. In the original problem, the pd.cut() function generates the same categories in each dataframe, namely 1 to 5, so no new categories are being added. |
Gotcha, here's a simpler example. In [44]: df = pd.DataFrame(columns=['a'])
In [45]: a = pd.DataFrame({"a": pd.Categorical([1, 2, 3], ordered=True)})
In [46]: a.a
Out[46]:
0 1
1 2
2 3
Name: a, dtype: category
Categories (3, int64): [1 < 2 < 3]
In [47]: df.append(a).a
Out[47]:
0 1
1 2
2 3
Name: a, dtype: category
Categories (3, int64): [1, 2, 3] So the orderedness of |
Certainly interested – but may not have the skill set. |
👍 just post here if you have any questions. Either way, thanks for the report. Just a hunch, but I would start looking in https://github.com/pydata/pandas/blob/1a9abc44bbfd65675fd99701fe33aad8805ab147/pandas/types/concat.py#L147 |
this is by definition. you need union_categorical |
@jreback I think my last example should work, no? E.g. |
Yes, I think that is correct. It only seems to happen when you start with an empty frame, or append an empty frame:
|
Hmm, is the empty set of categories ordered or not? 😄 cc @JanSchulz, thoughts? |
Well, if we say that an empty series is ordered=False, then it should actually raise an error instead of changing the order of the result :-) |
The problem is here: https://github.com/pydata/pandas/blob/1a9abc44bbfd65675fd99701fe33aad8805ab147/pandas/types/concat.py#L201 When concat is not dealing with only categoricals, but with a mixture of categoricals and object arrays, it takes the categories from the first categorical to concat, but not the other properties like ordered or not. Should be an easy fix to also pass ordered there. |
IMO there can be two interpretations:
pandas.DataFrame({"a": pandas.Categorical([1,2,3])}).append(pandas.DataFrame({"a": pandas.Categorical([1,2])})) # -> Error as it should be...
pandas.DataFrame({"a": pandas.Categorical([])}).append(pandas.DataFrame({"a": pandas.Categorical([1,2])})) # -> Also errors... The question is if an empty column is the same as a categorical column without any value IMO that's the difference between this two dataframes:
and
the first is just the usual "cast to something which can take both" which is the rule for everything but categorical. The second seems to be the upcast rules for int + object? So if the second follows the "normal rules", then IMO appending a categorical should also follow the usual categorical rules, aka erroring. |
I met the same problem in #13626 and wrote short summary of How about following spec:
|
xref #13410, #13524 Author: sinhrks <[email protected]> Closes #13763 from sinhrks/union_categoricals_ordered and squashes the following commits: 9cadc4e [sinhrks] ENH: union_categorical supports identical categories with ordered
I recently posted this on StackOverflow. It seems to be a bug so I am posting here as well.
I want to generate a dataframe that is created by appended several separate dataframes generated in a for loop. Each individual dataframe consists of a name column, a range of integers and a column identifying a category to which the integer belongs (e.g. quintile 1 to 5). If I generate each dataframe individually and then append one to the other to create a 'master' dataframe then there are no problems. However, when I use a loop to create each individual dataframe then trying to append a dataframe to the master dataframe results in:
ValueError: incompatible categories in categorical concat
A work-around (suggested by jezrael) involved appending each dataframe to a list of dataframes and concatenating them using pd.concat.
I've written a simplified loop to illustrate:
Code Sample, a copy-pastable example if possible
Expected Output
output of
pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.4.1.final.0
python-bits: 64
OS: Darwin
OS-release: 15.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8
pandas: 0.18.1
nose: None
pip: 1.5.6
setuptools: 20.1.1
Cython: None
numpy: 1.11.0
scipy: 0.16.1
statsmodels: None
xarray: None
IPython: 4.1.1
sphinx: None
patsy: None
dateutil: 2.5.3
pytz: 2016.4
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.5.0
openpyxl: 2.3.2
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: 0.7.4.None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: