BUG/API: sum of a string column with all-NaN or empty #60229

jorisvandenbossche · 2024-11-07T15:55:19Z

We decided to allow the sum operation for the future string dtype (PR in #59853, based on discussion in #59328).

But I ran into a strange case in groupby where the end result contains "0" in case of an empty or all-NaN group.

Reproducible example:

df = pd.DataFrame(
    {
        "key": [1, 2, 2, 3, 3, 3],
        "col1": [np.nan, 2, np.nan, 4, 5, 6],
        "col2": [np.nan, "b", np.nan, "d", "e", "f"],
    }
)
result = df.groupby("key").sum()

Currently, you get this:

>>> result
     col1 col2
key           
1     0.0    0
2     2.0    b
3    15.0  def

>>> result["col2"].values
array([0, 'b', 'def'], dtype=object)

So the "sum" operation has introduced a 0. Not very useful I think in context of strings, but at least it is object dtype and can contain anything.

However, with pd.options.future.infer_string = True enabled and starting from a proper string dtype, the result is seemingly the same (the repr looks the same), but the values in the column are now strings:

>>> result["col2"].values
<ArrowStringArrayNumpySemantics>
['0', 'b', 'def']
Length: 3, dtype: str

So the integer 0 has been converted to a string.

I think we certainly should not introduce this "0" string, and returning object dtype with 0 is also quite useless I think (but at least not inventing a new string in your data).
But if we have to return something, the empty string is probably the equivalent of 0 in case of string or the "sum" operation?

cc @rhshadrach @WillAyd @Dr-Irv

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2024-11-07T16:08:09Z

The above is in a groupby context, but there is of course also the normal column / Series case for empty or all-NA data. The current behaviour for our StringDtype variants:

>>> pd.Series([np.nan], dtype=pd.StringDtype("pyarrow", na_value=np.nan)).sum()
''

>>> pd.Series([np.nan], dtype=pd.StringDtype("python", na_value=np.nan)).sum()
0

So for the default pyarrow backend, it does give the empty string. Now, I don't remember explicitly implementing or testing that behaviour in #59853, so I assume this is just "by accident" because of how it is implemented.
(checking the implementation, this is indeed the case because it uses the join algorithm pc.binary_join(data_list, "") (where data_list is the data with missing values filtered out), which is essentially the equivalent for the python logic "".join(data), so if there is nothing to join, you get the join character, being an empty string here)

WillAyd · 2024-11-07T18:35:46Z

I think the pyarrow-backed behavior here makes sense to use generically

Dr-Irv · 2024-11-07T19:09:57Z

I think the pyarrow-backed behavior here makes sense to use generically

I agree with Will. If doing a sum on strings, the default value is the empty string when all values are missing. Just like the default value is 0 when the dtype is int or float.

Thanushri16 · 2024-11-12T02:19:21Z

take

jorisvandenbossche added Bug Groupby Strings String extension data type and string data labels Nov 7, 2024

jorisvandenbossche added this to the 2.3 milestone Nov 7, 2024

jorisvandenbossche mentioned this issue Nov 7, 2024

lib/datautils: groupby_agg on string column with missing values introduces 0 owid/etl#3515

Open

jorisvandenbossche mentioned this issue Nov 7, 2024

TRACKER: new default String dtype (pyarrow-backed, numpy NaN semantics) #54792

Open

41 tasks

github-actions bot assigned Thanushri16 Nov 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG/API: sum of a string column with all-NaN or empty #60229

BUG/API: sum of a string column with all-NaN or empty #60229

jorisvandenbossche commented Nov 7, 2024 •

edited

Loading

jorisvandenbossche commented Nov 7, 2024

WillAyd commented Nov 7, 2024

Dr-Irv commented Nov 7, 2024

Thanushri16 commented Nov 12, 2024

BUG/API: sum of a string column with all-NaN or empty #60229

BUG/API: sum of a string column with all-NaN or empty #60229

Comments

jorisvandenbossche commented Nov 7, 2024 • edited Loading

jorisvandenbossche commented Nov 7, 2024

WillAyd commented Nov 7, 2024

Dr-Irv commented Nov 7, 2024

Thanushri16 commented Nov 12, 2024

jorisvandenbossche commented Nov 7, 2024 •

edited

Loading