Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Behavior change when writing after setitem operations with pandas 2.0 vs pandas 1.5.3 #40

Closed
DriesSchaumont opened this issue Apr 7, 2023 · 2 comments
Labels
bug Something isn't working

Comments

@DriesSchaumont
Copy link

Describe the bug
With pandas 2.0.0, the concat behavior has changed when concatenating a boolean and numeric dtype. It the resulting dtype used to be a numeric dtype, which can be written by mudata. However, this has been changed to object, which results in TypeError: Can't implicitly convert non-string objects to strings. The behavior of bool + nan is also different from the behaviour of str + nan, the latter causing no problems.

Warning in pandas 1.5.3:

FutureWarning: Behavior when concatenating bool-dtype and numeric-dtype arrays is deprecated; in a future version these will cast to object dtype (instead of coercing bools to numeric values). To retain the old behavior, explicitly cast bool-dtype arrays to numeric dtype.

To Reproduce

import pandas as pd
import mudata
import anndata
import numpy as np
from itertools import product
import warnings

dtype_matrix = {"na": np.nan, "string": "str", "bool": True, "float": 1.0}

with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    for first_col, second_col in product(dtype_matrix.items(), repeat=2):
        first_col_type, first_col_val = first_col
        second_col_type, second_col_val = second_col
        m = mudata.MuData({
            "mod1": anndata.AnnData(pd.DataFrame([[1,2], [3,4]]), obs=pd.DataFrame(index=list("AB")), var=pd.DataFrame([["a", "b"], ["c", "d"]], index=["q", "w"], columns=["var1", "overlap"]), dtype=np.float64),
            "mod2": anndata.AnnData(pd.DataFrame([[5,6], [7,8]]), obs=pd.DataFrame(index=list("CD")), var=pd.DataFrame([["e", "f"], ["g", "h"]], index=["x", "y"], columns=["var2", "overlap"]), dtype=np.float64),
        })
        m.mod['mod1'].var['test'] = first_col_val        
        m.mod['mod2'].var['test'] = second_col_val
        m.update()
        could_write = True
        try:
            m.write("test.h5mu")
        except TypeError as e:
            could_write = False
        
        print(f"Concat {first_col_type} ({first_col_val}, {m.mod['mod1'].var['test'].dtype}) and {second_col_type} ({second_col_val}, {m.mod['mod2'].var['test'].dtype}) results in: {m.var['test'].dtype}, able to write: {could_write}")

print(f"Pandas: {pd.__version__}")
print(f"anndata: {anndata.__version__}")
print(f"mudata: {mudata.__version__}")

With pandas 2.0.0:

Concat na (nan, float64) and na (nan, float64) results in: float64, able to write: True
Concat na (nan, float64) and string (str, category) results in: object, able to write: True
Concat na (nan, float64) and bool (True, bool) results in: object, able to write: False <--
Concat na (nan, float64) and float (1.0, float64) results in: float64, able to write: True
Concat string (str, category) and na (nan, float64) results in: object, able to write: True
Concat string (str, category) and string (str, category) results in: object, able to write: True
Concat string (str, object) and bool (True, bool) results in: object, able to write: False
Concat string (str, object) and float (1.0, float64) results in: object, able to write: False
Concat bool (True, bool) and na (nan, float64) results in: object, able to write: False <--
Concat bool (True, bool) and string (str, object) results in: object, able to write: False
Concat bool (True, bool) and bool (True, bool) results in: bool, able to write: True
Concat bool (True, bool) and float (1.0, float64) results in: object, able to write: False
Concat float (1.0, float64) and na (nan, float64) results in: float64, able to write: True
Concat float (1.0, float64) and string (str, object) results in: object, able to write: False
Concat float (1.0, float64) and bool (True, bool) results in: float64, able to write: True
Concat float (1.0, float64) and float (1.0, float64) results in: float64, able to write: True
Pandas: 2.0.0
anndata: 0.8.0
mudata: 0.2.2

With pandas 1.5.3:

Concat na (nan, float64) and na (nan, float64) results in: float64, able to write: True
Concat na (nan, float64) and string (str, category) results in: object, able to write: True
Concat na (nan, float64) and bool (True, bool) results in: float64, able to write: True <--
Concat na (nan, float64) and float (1.0, float64) results in: float64, able to write: True
Concat string (str, category) and na (nan, float64) results in: object, able to write: True
Concat string (str, category) and string (str, category) results in: object, able to write: True
Concat string (str, object) and bool (True, bool) results in: object, able to write: False
Concat string (str, object) and float (1.0, float64) results in: object, able to write: False
Concat bool (True, bool) and na (nan, float64) results in: float64, able to write: True <--
Concat bool (True, bool) and string (str, object) results in: object, able to write: False
Concat bool (True, bool) and bool (True, bool) results in: bool, able to write: True
Concat bool (True, bool) and float (1.0, float64) results in: object, able to write: False
Concat float (1.0, float64) and na (nan, float64) results in: float64, able to write: True
Concat float (1.0, float64) and string (str, object) results in: object, able to write: False
Concat float (1.0, float64) and bool (True, bool) results in: float64, able to write: True
Concat float (1.0, float64) and float (1.0, float64) results in: float64, able to write: True
Pandas: 1.5.3
anndata: 0.8.0
mudata: 0.2.2

I think this can be tracked down to this concat:

data_common = pd.concat(
[getattr(a, attr)[columns_common] for m, a in self.mod.items()],
join="outer",
axis=0,
sort=False,
)

Expected behaviour
I would not expect a change in behavior.

System

  • OS: macOS Ventura
  • Python version: 3.10.10
  • Versions of libraries involved: see examples above

Additional context
Could be related to scverse/anndata#679 but the issue being reported here is a behavior change so I would flag this as a separate bug (either way the discrepancy between str + nan and bool + nan should be resolved).

@DriesSchaumont DriesSchaumont added the bug Something isn't working label Apr 7, 2023
gtca added a commit that referenced this issue May 25, 2023
Use "boolean" dtype instead of bool to deal with nullable bool arrays
@gtca
Copy link
Collaborator

gtca commented May 25, 2023

Hey @DriesSchaumont,

Thanks for noticing this change of behaviour with pandas 2.0 and providing a great example to test it.

I've started addressing it in #43 with boolean + nan value combination that you highlighted.
So far I'm taking advantage of nullable boolean arrays.

In case you have any thoughts on what behaviour you would find most intuitive and/or how we can potentially generalise this decision making beyond just bool -> boolean conversion for nullable boolean arrays, I'd be interested to discuss it!

@gtca
Copy link
Collaborator

gtca commented Sep 12, 2023

By the way, already with pandas 1.5.2 and mudata 0.2.3, float + bool is coerced to an object (same as bool + float).

And a short update is that mudata 0.3.0 will try to be more careful with using nullable boolean arrays to avoid potential issues like scverse/muon#111 (e.g. by using bool when there is no NA in the column in the end).

@gtca gtca closed this as completed Sep 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants