You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
So the "sum" operation has introduced a 0. Not very useful I think in context of strings, but at least it is object dtype and can contain anything.
However, with pd.options.future.infer_string = True enabled and starting from a proper string dtype, the result is seemingly the same (the repr looks the same), but the values in the column are now strings:
I think we certainly should not introduce this "0" string, and returning object dtype with 0 is also quite useless I think (but at least not inventing a new string in your data).
But if we have to return something, the empty string is probably the equivalent of 0 in case of string or the "sum" operation?
The above is in a groupby context, but there is of course also the normal column / Series case for empty or all-NA data. The current behaviour for our StringDtype variants:
So for the default pyarrow backend, it does give the empty string. Now, I don't remember explicitly implementing or testing that behaviour in #59853, so I assume this is just "by accident" because of how it is implemented.
(checking the implementation, this is indeed the case because it uses the join algorithm pc.binary_join(data_list, "") (where data_list is the data with missing values filtered out), which is essentially the equivalent for the python logic "".join(data), so if there is nothing to join, you get the join character, being an empty string here)
I think the pyarrow-backed behavior here makes sense to use generically
I agree with Will. If doing a sum on strings, the default value is the empty string when all values are missing. Just like the default value is 0 when the dtype is int or float.
We decided to allow the
sum
operation for the future string dtype (PR in #59853, based on discussion in #59328).But I ran into a strange case in groupby where the end result contains
"0"
in case of an empty or all-NaN group.Reproducible example:
Currently, you get this:
So the "sum" operation has introduced a
0
. Not very useful I think in context of strings, but at least it is object dtype and can contain anything.However, with
pd.options.future.infer_string = True
enabled and starting from a proper string dtype, the result is seemingly the same (the repr looks the same), but the values in the column are now strings:So the integer
0
has been converted to a string.I think we certainly should not introduce this
"0"
string, and returning object dtype with0
is also quite useless I think (but at least not inventing a new string in your data).But if we have to return something, the empty string is probably the equivalent of 0 in case of string or the "sum" operation?
cc @rhshadrach @WillAyd @Dr-Irv
The text was updated successfully, but these errors were encountered: