Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: union_categoricals in concat #14177

Open
jorisvandenbossche opened this issue Sep 7, 2016 · 6 comments
Open

API: union_categoricals in concat #14177

jorisvandenbossche opened this issue Sep 7, 2016 · 6 comments
Labels
Categorical Categorical Data Type Enhancement Needs Discussion Requires discussion from core team before further action Reshaping Concat, Merge/Join, Stack/Unstack, Explode

Comments

@jorisvandenbossche
Copy link
Member

Follow up of #13767 (this part was left out of the final merged version). Summary copied from there:

Discussion: addition of new keyword to concat to union categoricals with different categories

PR #13767 proposed to add a union_categoricals keyword to concat, which lets you concatenate categoricals with different categories (like the union_categoricals function does). Default would be False.

There was some discussion about the name to use. @wesm suggested to use a dict to pass type specific options. But that is maybe something we should first explore further to see if there are other use cases for such a construct?

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Aug 30, 2018

So I looked at the discussion in #13767 where @sinhrks had implemented the union_categoricals=True option to pd.concat(), and we need this option in our project.

Here's the use case. I have all the data sitting in a MySQL database. There are columns that would be better handled as categorical columns. We have a complex SQL join creating a simple 5 column table of 33 million rows. 3 of the columns should be categorical. Reading the table from CSV, pandas takes 1.5GB, before converting the 3 columns to categories. Reading the table using standard MySQL connectors and SQL alchemy takes 14.5GB. There is an option to SQL Alchemy that causes the server to chunk the data, and that reduces the memory footprint to 6.5GB. If I could then use the chunksize argument on pd.read_sql(), where I convert the 3 columns of each chunk to category dtype, and pass the resulting generator from read_sql() to pd.concat() with union_categoricals=True, then I could let the categories be dynamically defined as the data is read in from the SQL database.

I'm willing to look at what @sinhrks did and make a PR so we can get the memory usage down, or maybe we can just have him put that functionality in a PR to support this use case.

@ivirshup
Copy link
Contributor

ivirshup commented Nov 21, 2019

I have related use cases for this, or a broader "the return type should still be categorical". First I'll layout the use case, then give the specific pandas functions where this could be used.

The analysis problem: During data analysis (of single cell RNA-seq data), I generate multiple unsupervised clusterings for my dataset. I use these as a starting point for figuring out what categories of samples exist in my dataset. I'll want to manually curate these labelling based on downstream analysis and feedback from domain experts.

The pandas functions: pd.Series.map is useful here. If I want to merge two clusters I can do:

new_clusters = clusters.map(lambda x: {"cluster_a": "cluster_b"}.get(x, x))

But now I have an object series, since that's the behavior if the number of categories changes. I'd like to have a keyword argument that specified I still want categorical outputs.

Another case is pd.Series.where. I may want to replace labels for some observations with clusters from another.

s = pd.Series(list("aabcdd"), dtype="category") 
t = pd.Series(list("effggh"), dtype="category") 
s.where(lambda x: np.isin(x, ("a", "b")), t)

This currently returns an object series and throws a FutureWarning that it will be a value error unless I explicitly construct a new set of categories and cast to those. I'd rather that I could just specify that I want a categorical (either union or using the resulting set of categories) in the call. Basically, this feels verbose for something I do interactively:

# From this:
s.where(lambda x: np.isin(x, ("a", "b")), t)
# To:
new_cats = s.cat.categories.union(t.cat.categories)
s.cat.set_categories(new_cats).where(lambda x: x.isin(x, ("a", "b")), t.cat.set_categories(new_cats))

Edit: Oof. This comment was so bad it fails CI 😳: https://github.com/pandas-dev/pandas/runs/313172123
Raw logs

@jbrockmendel
Copy link
Member

jbrockmendel commented Oct 31, 2020

@jorisvandenbossche can the idea here be updated in a way consistent with the "dont special-case pandas-internal EAs" goal?

@jorisvandenbossche
Copy link
Member Author

This is categorical-specific, AFAIK

@ivirshup
Copy link
Contributor

ivirshup commented Nov 9, 2020

Wouldn't the idea of adding a dict valued argument for type specific options make this more general (as mentioned at the top)? E.g.

pd.concat(
    {...},
    dtype_kwargs={pd.Categorical: {"union": True}}
)

@mroeschke mroeschke added Reshaping Concat, Merge/Join, Stack/Unstack, Explode and removed API Design labels May 1, 2021
@jbrockmendel
Copy link
Member

Discussed this on the dev call today, agreed to add a keyword to pd.concat to opt in to this behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Enhancement Needs Discussion Requires discussion from core team before further action Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

No branches or pull requests

5 participants