Behavior change when writing after setitem operations with pandas 2.0 vs pandas 1.5.3 #40

DriesSchaumont · 2023-04-07T08:51:07Z

Describe the bug
With pandas 2.0.0, the concat behavior has changed when concatenating a boolean and numeric dtype. It the resulting dtype used to be a numeric dtype, which can be written by mudata. However, this has been changed to object, which results in TypeError: Can't implicitly convert non-string objects to strings. The behavior of bool + nan is also different from the behaviour of str + nan, the latter causing no problems.

Warning in pandas 1.5.3:

FutureWarning: Behavior when concatenating bool-dtype and numeric-dtype arrays is deprecated; in a future version these will cast to object dtype (instead of coercing bools to numeric values). To retain the old behavior, explicitly cast bool-dtype arrays to numeric dtype.

To Reproduce

import pandas as pd
import mudata
import anndata
import numpy as np
from itertools import product
import warnings

dtype_matrix = {"na": np.nan, "string": "str", "bool": True, "float": 1.0}

with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    for first_col, second_col in product(dtype_matrix.items(), repeat=2):
        first_col_type, first_col_val = first_col
        second_col_type, second_col_val = second_col
        m = mudata.MuData({
            "mod1": anndata.AnnData(pd.DataFrame([[1,2], [3,4]]), obs=pd.DataFrame(index=list("AB")), var=pd.DataFrame([["a", "b"], ["c", "d"]], index=["q", "w"], columns=["var1", "overlap"]), dtype=np.float64),
            "mod2": anndata.AnnData(pd.DataFrame([[5,6], [7,8]]), obs=pd.DataFrame(index=list("CD")), var=pd.DataFrame([["e", "f"], ["g", "h"]], index=["x", "y"], columns=["var2", "overlap"]), dtype=np.float64),
        })
        m.mod['mod1'].var['test'] = first_col_val        
        m.mod['mod2'].var['test'] = second_col_val
        m.update()
        could_write = True
        try:
            m.write("test.h5mu")
        except TypeError as e:
            could_write = False
        
        print(f"Concat {first_col_type} ({first_col_val}, {m.mod['mod1'].var['test'].dtype}) and {second_col_type} ({second_col_val}, {m.mod['mod2'].var['test'].dtype}) results in: {m.var['test'].dtype}, able to write: {could_write}")

print(f"Pandas: {pd.__version__}")
print(f"anndata: {anndata.__version__}")
print(f"mudata: {mudata.__version__}")

With pandas 2.0.0:

Concat na (nan, float64) and na (nan, float64) results in: float64, able to write: True
Concat na (nan, float64) and string (str, category) results in: object, able to write: True
Concat na (nan, float64) and bool (True, bool) results in: object, able to write: False <--
Concat na (nan, float64) and float (1.0, float64) results in: float64, able to write: True
Concat string (str, category) and na (nan, float64) results in: object, able to write: True
Concat string (str, category) and string (str, category) results in: object, able to write: True
Concat string (str, object) and bool (True, bool) results in: object, able to write: False
Concat string (str, object) and float (1.0, float64) results in: object, able to write: False
Concat bool (True, bool) and na (nan, float64) results in: object, able to write: False <--
Concat bool (True, bool) and string (str, object) results in: object, able to write: False
Concat bool (True, bool) and bool (True, bool) results in: bool, able to write: True
Concat bool (True, bool) and float (1.0, float64) results in: object, able to write: False
Concat float (1.0, float64) and na (nan, float64) results in: float64, able to write: True
Concat float (1.0, float64) and string (str, object) results in: object, able to write: False
Concat float (1.0, float64) and bool (True, bool) results in: float64, able to write: True
Concat float (1.0, float64) and float (1.0, float64) results in: float64, able to write: True
Pandas: 2.0.0
anndata: 0.8.0
mudata: 0.2.2

With pandas 1.5.3:

Concat na (nan, float64) and na (nan, float64) results in: float64, able to write: True
Concat na (nan, float64) and string (str, category) results in: object, able to write: True
Concat na (nan, float64) and bool (True, bool) results in: float64, able to write: True <--
Concat na (nan, float64) and float (1.0, float64) results in: float64, able to write: True
Concat string (str, category) and na (nan, float64) results in: object, able to write: True
Concat string (str, category) and string (str, category) results in: object, able to write: True
Concat string (str, object) and bool (True, bool) results in: object, able to write: False
Concat string (str, object) and float (1.0, float64) results in: object, able to write: False
Concat bool (True, bool) and na (nan, float64) results in: float64, able to write: True <--
Concat bool (True, bool) and string (str, object) results in: object, able to write: False
Concat bool (True, bool) and bool (True, bool) results in: bool, able to write: True
Concat bool (True, bool) and float (1.0, float64) results in: object, able to write: False
Concat float (1.0, float64) and na (nan, float64) results in: float64, able to write: True
Concat float (1.0, float64) and string (str, object) results in: object, able to write: False
Concat float (1.0, float64) and bool (True, bool) results in: float64, able to write: True
Concat float (1.0, float64) and float (1.0, float64) results in: float64, able to write: True
Pandas: 1.5.3
anndata: 0.8.0
mudata: 0.2.2

I think this can be tracked down to this concat:

mudata/mudata/_core/mudata.py

Lines 543 to 548 in da2de81

    
           data_common = pd.concat( 
        
               [getattr(a, attr)[columns_common] for m, a in self.mod.items()], 
        
               join="outer", 
        
               axis=0, 
        
               sort=False, 
        
           )

Expected behaviour
I would not expect a change in behavior.

System

OS: macOS Ventura
Python version: 3.10.10
Versions of libraries involved: see examples above

Additional context
Could be related to scverse/anndata#679 but the issue being reported here is a behavior change so I would flag this as a separate bug (either way the discrepancy between str + nan and bool + nan should be resolved).

The text was updated successfully, but these errors were encountered:

Use "boolean" dtype instead of bool to deal with nullable bool arrays

gtca · 2023-05-25T23:47:42Z

Hey @DriesSchaumont,

Thanks for noticing this change of behaviour with pandas 2.0 and providing a great example to test it.

I've started addressing it in #43 with boolean + nan value combination that you highlighted.
So far I'm taking advantage of nullable boolean arrays.

In case you have any thoughts on what behaviour you would find most intuitive and/or how we can potentially generalise this decision making beyond just bool -> boolean conversion for nullable boolean arrays, I'd be interested to discuss it!

gtca · 2023-09-12T13:25:10Z

By the way, already with pandas 1.5.2 and mudata 0.2.3, float + bool is coerced to an object (same as bool + float).

And a short update is that mudata 0.3.0 will try to be more careful with using nullable boolean arrays to avoid potential issues like scverse/muon#111 (e.g. by using bool when there is no NA in the column in the end).

DriesSchaumont added the bug Something isn't working label Apr 7, 2023

gtca added a commit that referenced this issue May 25, 2023

Address #40

bc7a066

Use "boolean" dtype instead of bool to deal with nullable bool arrays

gtca mentioned this issue May 25, 2023

Pandas 2.0 compatibility #43

Merged

gtca closed this as completed Sep 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Behavior change when writing after setitem operations with pandas 2.0 vs pandas 1.5.3 #40

Behavior change when writing after setitem operations with pandas 2.0 vs pandas 1.5.3 #40

DriesSchaumont commented Apr 7, 2023

gtca commented May 25, 2023

gtca commented Sep 12, 2023 •

edited

Loading

Behavior change when writing after setitem operations with pandas 2.0 vs pandas 1.5.3 #40

Behavior change when writing after setitem operations with pandas 2.0 vs pandas 1.5.3 #40

Comments

DriesSchaumont commented Apr 7, 2023

gtca commented May 25, 2023

gtca commented Sep 12, 2023 • edited Loading

gtca commented Sep 12, 2023 •

edited

Loading