Skip to content

BUG/API: sum of a string column with all-NaN or empty #60229

@jorisvandenbossche

Description

@jorisvandenbossche

We decided to allow the sum operation for the future string dtype (PR in #59853, based on discussion in #59328).

But I ran into a strange case in groupby where the end result contains "0" in case of an empty or all-NaN group.

Reproducible example:

df = pd.DataFrame( { "key": [1, 2, 2, 3, 3, 3], "col1": [np.nan, 2, np.nan, 4, 5, 6], "col2": [np.nan, "b", np.nan, "d", "e", "f"], } ) result = df.groupby("key").sum()

Currently, you get this:

>>> result col1 col2 key 1 0.0 0 2 2.0 b 3 15.0 def >>> result["col2"].values array([0, 'b', 'def'], dtype=object)

So the "sum" operation has introduced a 0. Not very useful I think in context of strings, but at least it is object dtype and can contain anything.

However, with pd.options.future.infer_string = True enabled and starting from a proper string dtype, the result is seemingly the same (the repr looks the same), but the values in the column are now strings:

>>> result["col2"].values <ArrowStringArrayNumpySemantics> ['0', 'b', 'def'] Length: 3, dtype: str

So the integer 0 has been converted to a string.

I think we certainly should not introduce this "0" string, and returning object dtype with 0 is also quite useless I think (but at least not inventing a new string in your data).
But if we have to return something, the empty string is probably the equivalent of 0 in case of string or the "sum" operation?

cc @rhshadrach @WillAyd @Dr-Irv

Metadata

Metadata

Assignees

Labels

BugGroupbyStringsString extension data type and string data

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions