-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API: union_categoricals in concat #14177
Comments
So I looked at the discussion in #13767 where @sinhrks had implemented the Here's the use case. I have all the data sitting in a MySQL database. There are columns that would be better handled as categorical columns. We have a complex SQL join creating a simple 5 column table of 33 million rows. 3 of the columns should be categorical. Reading the table from CSV, pandas takes 1.5GB, before converting the 3 columns to categories. Reading the table using standard MySQL connectors and SQL alchemy takes 14.5GB. There is an option to SQL Alchemy that causes the server to chunk the data, and that reduces the memory footprint to 6.5GB. If I could then use the I'm willing to look at what @sinhrks did and make a PR so we can get the memory usage down, or maybe we can just have him put that functionality in a PR to support this use case. |
I have related use cases for this, or a broader "the return type should still be categorical". First I'll layout the use case, then give the specific pandas functions where this could be used. The analysis problem: During data analysis (of single cell RNA-seq data), I generate multiple unsupervised clusterings for my dataset. I use these as a starting point for figuring out what categories of samples exist in my dataset. I'll want to manually curate these labelling based on downstream analysis and feedback from domain experts. The pandas functions: new_clusters = clusters.map(lambda x: {"cluster_a": "cluster_b"}.get(x, x)) But now I have an object series, since that's the behavior if the number of categories changes. I'd like to have a keyword argument that specified I still want categorical outputs. Another case is s = pd.Series(list("aabcdd"), dtype="category")
t = pd.Series(list("effggh"), dtype="category")
s.where(lambda x: np.isin(x, ("a", "b")), t) This currently returns an object series and throws a # From this:
s.where(lambda x: np.isin(x, ("a", "b")), t)
# To:
new_cats = s.cat.categories.union(t.cat.categories)
s.cat.set_categories(new_cats).where(lambda x: x.isin(x, ("a", "b")), t.cat.set_categories(new_cats)) Edit: Oof. This comment was so bad it fails CI 😳: https://github.com/pandas-dev/pandas/runs/313172123 |
@jorisvandenbossche can the idea here be updated in a way consistent with the "dont special-case pandas-internal EAs" goal? |
This is categorical-specific, AFAIK |
Wouldn't the idea of adding a dict valued argument for type specific options make this more general (as mentioned at the top)? E.g. pd.concat(
{...},
dtype_kwargs={pd.Categorical: {"union": True}}
) |
Discussed this on the dev call today, agreed to add a keyword to pd.concat to opt in to this behavior. |
Follow up of #13767 (this part was left out of the final merged version). Summary copied from there:
Discussion: addition of new keyword to
concat
to union categoricals with different categoriesPR #13767 proposed to add a
union_categoricals
keyword toconcat
, which lets you concatenate categoricals with different categories (like theunion_categoricals
function does). Default would be False.There was some discussion about the name to use. @wesm suggested to use a dict to pass type specific options. But that is maybe something we should first explore further to see if there are other use cases for such a construct?
The text was updated successfully, but these errors were encountered: