Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG/API: sum of a string column with all-NaN or empty #60229

Open
Tracked by #54792
jorisvandenbossche opened this issue Nov 7, 2024 · 4 comments
Open
Tracked by #54792

BUG/API: sum of a string column with all-NaN or empty #60229

jorisvandenbossche opened this issue Nov 7, 2024 · 4 comments
Assignees
Labels
Bug Groupby Strings String extension data type and string data
Milestone

Comments

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Nov 7, 2024

We decided to allow the sum operation for the future string dtype (PR in #59853, based on discussion in #59328).

But I ran into a strange case in groupby where the end result contains "0" in case of an empty or all-NaN group.

Reproducible example:

df = pd.DataFrame(
    {
        "key": [1, 2, 2, 3, 3, 3],
        "col1": [np.nan, 2, np.nan, 4, 5, 6],
        "col2": [np.nan, "b", np.nan, "d", "e", "f"],
    }
)
result = df.groupby("key").sum()

Currently, you get this:

>>> result
     col1 col2
key           
1     0.0    0
2     2.0    b
3    15.0  def

>>> result["col2"].values
array([0, 'b', 'def'], dtype=object)

So the "sum" operation has introduced a 0. Not very useful I think in context of strings, but at least it is object dtype and can contain anything.

However, with pd.options.future.infer_string = True enabled and starting from a proper string dtype, the result is seemingly the same (the repr looks the same), but the values in the column are now strings:

>>> result["col2"].values
<ArrowStringArrayNumpySemantics>
['0', 'b', 'def']
Length: 3, dtype: str

So the integer 0 has been converted to a string.

I think we certainly should not introduce this "0" string, and returning object dtype with 0 is also quite useless I think (but at least not inventing a new string in your data).
But if we have to return something, the empty string is probably the equivalent of 0 in case of string or the "sum" operation?

cc @rhshadrach @WillAyd @Dr-Irv

@jorisvandenbossche jorisvandenbossche added Bug Groupby Strings String extension data type and string data labels Nov 7, 2024
@jorisvandenbossche jorisvandenbossche added this to the 2.3 milestone Nov 7, 2024
@jorisvandenbossche
Copy link
Member Author

The above is in a groupby context, but there is of course also the normal column / Series case for empty or all-NA data. The current behaviour for our StringDtype variants:

>>> pd.Series([np.nan], dtype=pd.StringDtype("pyarrow", na_value=np.nan)).sum()
''

>>> pd.Series([np.nan], dtype=pd.StringDtype("python", na_value=np.nan)).sum()
0

So for the default pyarrow backend, it does give the empty string. Now, I don't remember explicitly implementing or testing that behaviour in #59853, so I assume this is just "by accident" because of how it is implemented.
(checking the implementation, this is indeed the case because it uses the join algorithm pc.binary_join(data_list, "") (where data_list is the data with missing values filtered out), which is essentially the equivalent for the python logic "".join(data), so if there is nothing to join, you get the join character, being an empty string here)

@WillAyd
Copy link
Member

WillAyd commented Nov 7, 2024

I think the pyarrow-backed behavior here makes sense to use generically

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Nov 7, 2024

I think the pyarrow-backed behavior here makes sense to use generically

I agree with Will. If doing a sum on strings, the default value is the empty string when all values are missing. Just like the default value is 0 when the dtype is int or float.

@Thanushri16
Copy link

take

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Groupby Strings String extension data type and string data
Projects
None yet
Development

No branches or pull requests

4 participants