API: union_categoricals in concat #14177

jorisvandenbossche · 2016-09-07T13:20:54Z

Follow up of #13767 (this part was left out of the final merged version). Summary copied from there:

Discussion: addition of new keyword to concat to union categoricals with different categories

PR #13767 proposed to add a union_categoricals keyword to concat, which lets you concatenate categoricals with different categories (like the union_categoricals function does). Default would be False.

There was some discussion about the name to use. @wesm suggested to use a dict to pass type specific options. But that is maybe something we should first explore further to see if there are other use cases for such a construct?

The text was updated successfully, but these errors were encountered:

Dr-Irv · 2018-08-30T15:00:39Z

So I looked at the discussion in #13767 where @sinhrks had implemented the union_categoricals=True option to pd.concat(), and we need this option in our project.

Here's the use case. I have all the data sitting in a MySQL database. There are columns that would be better handled as categorical columns. We have a complex SQL join creating a simple 5 column table of 33 million rows. 3 of the columns should be categorical. Reading the table from CSV, pandas takes 1.5GB, before converting the 3 columns to categories. Reading the table using standard MySQL connectors and SQL alchemy takes 14.5GB. There is an option to SQL Alchemy that causes the server to chunk the data, and that reduces the memory footprint to 6.5GB. If I could then use the chunksize argument on pd.read_sql(), where I convert the 3 columns of each chunk to category dtype, and pass the resulting generator from read_sql() to pd.concat() with union_categoricals=True, then I could let the categories be dynamically defined as the data is read in from the SQL database.

I'm willing to look at what @sinhrks did and make a PR so we can get the memory usage down, or maybe we can just have him put that functionality in a PR to support this use case.

ivirshup · 2019-11-21T05:41:08Z

I have related use cases for this, or a broader "the return type should still be categorical". First I'll layout the use case, then give the specific pandas functions where this could be used.

The analysis problem: During data analysis (of single cell RNA-seq data), I generate multiple unsupervised clusterings for my dataset. I use these as a starting point for figuring out what categories of samples exist in my dataset. I'll want to manually curate these labelling based on downstream analysis and feedback from domain experts.

The pandas functions: pd.Series.map is useful here. If I want to merge two clusters I can do:

new_clusters = clusters.map(lambda x: {"cluster_a": "cluster_b"}.get(x, x))

But now I have an object series, since that's the behavior if the number of categories changes. I'd like to have a keyword argument that specified I still want categorical outputs.

Another case is pd.Series.where. I may want to replace labels for some observations with clusters from another.

s = pd.Series(list("aabcdd"), dtype="category") 
t = pd.Series(list("effggh"), dtype="category") 
s.where(lambda x: np.isin(x, ("a", "b")), t)

This currently returns an object series and throws a FutureWarning that it will be a value error unless I explicitly construct a new set of categories and cast to those. I'd rather that I could just specify that I want a categorical (either union or using the resulting set of categories) in the call. Basically, this feels verbose for something I do interactively:

# From this:
s.where(lambda x: np.isin(x, ("a", "b")), t)
# To:
new_cats = s.cat.categories.union(t.cat.categories)
s.cat.set_categories(new_cats).where(lambda x: x.isin(x, ("a", "b")), t.cat.set_categories(new_cats))

Edit: Oof. This comment was so bad it fails CI 😳: https://github.com/pandas-dev/pandas/runs/313172123
Raw logs

jbrockmendel · 2020-10-31T00:44:34Z

@jorisvandenbossche can the idea here be updated in a way consistent with the "dont special-case pandas-internal EAs" goal?

jorisvandenbossche · 2020-11-03T19:18:10Z

This is categorical-specific, AFAIK

ivirshup · 2020-11-09T09:28:22Z

Wouldn't the idea of adding a dict valued argument for type specific options make this more general (as mentioned at the top)? E.g.

pd.concat(
    {...},
    dtype_kwargs={pd.Categorical: {"union": True}}
)

jbrockmendel · 2023-02-22T23:25:55Z

Discussed this on the dev call today, agreed to add a keyword to pd.concat to opt in to this behavior.

jorisvandenbossche added API Design Categorical Categorical Data Type Needs Discussion Requires discussion from core team before further action labels Sep 7, 2016

TomAugspurger mentioned this issue Sep 12, 2017

Categorical data not surviving a repartition dask/dask#2674

Closed

TomAugspurger mentioned this issue Sep 22, 2017

Concat on non-identical categories in categorical indexes raises TypeError #17629

Closed

Dr-Irv mentioned this issue Oct 22, 2018

Missing Values and Categoricals - inconsistent dtypes #23242

Open

ivirshup mentioned this issue Nov 21, 2019

CI: Action to assign issues fails when quotes are present #29768

Closed

Dr-Irv mentioned this issue Jan 31, 2020

ENH: pd.cut should be able to return a Series of IntervalDtype #31459

Open

mroeschke added the Enhancement label Jun 28, 2020

mroeschke added Reshaping Concat, Merge/Join, Stack/Unstack, Explode and removed API Design labels May 1, 2021

jbrockmendel mentioned this issue Feb 13, 2023

BUG: pd.concat doesn't preserve categorical dtypes #51362

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API: union_categoricals in concat #14177

API: union_categoricals in concat #14177

jorisvandenbossche commented Sep 7, 2016

Dr-Irv commented Aug 30, 2018

ivirshup commented Nov 21, 2019 •

edited

Loading

jbrockmendel commented Oct 31, 2020 •

edited

Loading

jorisvandenbossche commented Nov 3, 2020

ivirshup commented Nov 9, 2020 •

edited

Loading

jbrockmendel commented Feb 22, 2023

API: union_categoricals in concat #14177

API: union_categoricals in concat #14177

Comments

jorisvandenbossche commented Sep 7, 2016

Dr-Irv commented Aug 30, 2018

ivirshup commented Nov 21, 2019 • edited Loading

jbrockmendel commented Oct 31, 2020 • edited Loading

jorisvandenbossche commented Nov 3, 2020

ivirshup commented Nov 9, 2020 • edited Loading

jbrockmendel commented Feb 22, 2023

ivirshup commented Nov 21, 2019 •

edited

Loading

jbrockmendel commented Oct 31, 2020 •

edited

Loading

ivirshup commented Nov 9, 2020 •

edited

Loading