-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: concat and append now can handle unordered categories #13767
ENH: concat and append now can handle unordered categories #13767
Conversation
Current coverage is 85.25% (diff: 100%)@@ master #13767 diff @@
==========================================
Files 139 139
Lines 50496 50491 -5
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
- Hits 43049 43044 -5
Misses 7447 7447
Partials 0 0
|
@@ -3852,16 +3852,15 @@ def test_concat(self): | |||
res = pd.concat([df, df]) | |||
tm.assert_frame_equal(exp, res) | |||
|
|||
# Concat should raise if the two categoricals do not have the same | |||
# categories |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO the old test case should be modified that it still again tests the case that "more unequal" cats fail the concat.
What is the actual "problem" or use case here? I'm opposed to this PR: IMO So, I find the master variant more in line with that thinking. What I'm not sure is whether it should raise or (as it does now) convert to object. Other cat handling of two different cats (apart from the special I can see a logic for "categorical + iterable of the same values as in the categories = new categorical with the same categories as the first one" as that can be thought similar as |
@JanSchulz thx for sharing your opinion. Can you illustrate your intention covering all possible patterns? (I copied mine from #13524).
|
I come from using catagoricals as things like lickert scales: "Fully agree ... full disagree" and such things. From that standpoint, I usually read data in, convert to cat and replace the categories with the full set. In that case the categories have a specific meaning (e.g the lickert scale or in other cases the 50 US states or such things which are fixed). From that comes the rule that you shouldn't be able to set a value in the cat column to something not in categories (adding "Germany" to a cat encoding the 50 US states makes no sense). And from that the "error if combining two different categories" rule. So I'm fully happy that there isn't any "default" (i.e. anything in the I see The second usecase is as an internal helper for someone using it with reading in csv files before letting the user do some "meaningful" replacements. |
Thanks for sharing the usecase. Based on this, what the "rules" on your mind are? can u list up possible rules like: |
IMO these are the rules:
|
I agree with @JanSchulz rules, except where The idea is that I suppose that we could further add an argument to |
If you also want to change setting as well, then this is a whole different thing... IMO the above rules are what currently applies. I'm still not convinced: So what happens if I have a Categorical which is encoded as What is the actual usecase here: internal csv parsing of different files to a single dataframe? If so, then in my eyes this looks (again :-)) like the "pd-String" usecase: you encode it directly to save memory and then you are have the problem to concat two dfs because you don't know the strings (=you have no beforehand knowledge about the expected categories). If so, then there are multiple different ways to solve this:
|
I agree that Based on this, I've organized the rules as below table.
Also, adding following options to
|
693730c
to
9402cb1
Compare
Categorical Concatenation | ||
^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
- A function :func:`union_categorical` has been added for combining categoricals, see :ref:`Unioning Categoricals<categorical.union>` (:issue:`13361`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
categoricals
9402cb1
to
113418d
Compare
once updated based on the discussion. Note that following fixed are not included:
|
b7c3b12
to
940a137
Compare
@@ -1258,9 +1258,12 @@ def concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False, | |||
join_axes : list of Index objects | |||
Specific indexes to use for the other n - 1 axes instead of performing | |||
inner/outer set logic | |||
verify_integrity : boolean, default False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is a doc-string in merge.rst
, can you make sure to update that as well.
a04ff74
to
561735a
Compare
ok, let's take out |
@jreback that sounds good to me. Easier to add it later, and having the extra function is a stopgap for users who need it |
561735a
to
86cc682
Compare
|
||
.. _categorical.concat: | ||
|
||
Concatenation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a link with a note in merge.rst
to here
@sinhrks can you update |
@sinhrks can you update? |
86cc682
to
7f1e293
Compare
7f1e293
to
589d88d
Compare
@sinhrks I think the decision was to take out the kwarg for now? (see #13767 (comment) above). |
11d47d2
to
96a372e
Compare
@jorisvandenbossche You're right. Updated to remove |
|
||
# all category nan-likes => category | ||
s1 = pd.Series([np.nan, np.nan], dtype='category') | ||
s2 = pd.Series([np.nan, np.nan], dtype='category') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can u add simikar for Datetimes and all NaT ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, will do in today.
…eter * github.com:pydata/pandas: (554 commits) BUG: compat with Stata ver 111 Fix: F999 dictionary key '2000q4' repeated with different values (pandas-dev#14198) BLD: Test for Python 3.5 with C locale BUG: DatetimeTZBlock can't assign values near dst boundary BUG: union_categorical with Series and cat idx BUG: fix str.contains for series containing only nan values BUG: Categorical constructor not idempotent with ext dtype TST: Make encoded sep check more locale sensitive (pandas-dev#14161) DOC: minor typo in 0.19.0 whatsnew file (pandas-dev#14185) BUG: fix tz-aware datetime convert to DatetimeIndex (GH 14088) BUG : bug in setting a slice of a Series with a np.timedelta64 RLS: v0.19.0rc1 DOC: clean-up 0.19.0 whatsnew file (pandas-dev#14176) DOC: cleanup build warnings (pandas-dev#14172) Add steps to run gbq integration testing to the contributing docs (pandas-dev#14144) ENH: concat and append now can handle unordered categories (pandas-dev#13767) DEPR: Deprecate pandas.core.datetools (pandas-dev#14105) API/DEPR: Remove +/- as setops for DatetimeIndex/PeriodIndex (GH9630) (pandas-dev#14164) Fix trivial typo in comment (pandas-dev#14174) API/DEPR: Remove +/- as setops for Index (GH8227) (pandas-dev#14127) ...
* github.com:pydata/pandas: (554 commits) BUG: compat with Stata ver 111 Fix: F999 dictionary key '2000q4' repeated with different values (pandas-dev#14198) BLD: Test for Python 3.5 with C locale BUG: DatetimeTZBlock can't assign values near dst boundary BUG: union_categorical with Series and cat idx BUG: fix str.contains for series containing only nan values BUG: Categorical constructor not idempotent with ext dtype TST: Make encoded sep check more locale sensitive (pandas-dev#14161) DOC: minor typo in 0.19.0 whatsnew file (pandas-dev#14185) BUG: fix tz-aware datetime convert to DatetimeIndex (GH 14088) BUG : bug in setting a slice of a Series with a np.timedelta64 RLS: v0.19.0rc1 DOC: clean-up 0.19.0 whatsnew file (pandas-dev#14176) DOC: cleanup build warnings (pandas-dev#14172) Add steps to run gbq integration testing to the contributing docs (pandas-dev#14144) ENH: concat and append now can handle unordered categories (pandas-dev#13767) DEPR: Deprecate pandas.core.datetools (pandas-dev#14105) API/DEPR: Remove +/- as setops for DatetimeIndex/PeriodIndex (GH9630) (pandas-dev#14164) Fix trivial typo in comment (pandas-dev#14174) API/DEPR: Remove +/- as setops for Index (GH8227) (pandas-dev#14127) ...
git diff upstream/master | flake8 --diff
on current master:
this PR (updated according to the discussion):
CC: @JanSchulz, @chris-b1