BUG/API (string dtype): return float dtype for series[str].rank() by jorisvandenbossche · Pull Request #59768 · pandas-dev/pandas (original) (raw)
Noticed in #59758 (comment)
This is partially fixing a bug, because currently for cases where we actually need a float result but try to convert it back to an int, we get an error:
In [58]: pd.Series([2, 1, 1]).rank() Out[58]: 0 3.0 1 1.5 2 1.5 dtype: float64
In [59]: pd.Series(["2", "1", "1"], dtype="string[pyarrow]").rank() ... File ~/scipy/repos/pandas/pandas/core/arrays/string_arrow.py:445, in _convert_int_result(self, result)
File ~/scipy/repos/pandas/pandas/core/arrays/numeric.py:93, in NumericDtype.from_arrow(self, array) ---> 93 array = array.cast(pyarrow_type) ... ArrowInvalid: Float value 1.5 was truncated converting to int64
But in general we should also decide what to do return here. For our default dtypes, it seems we decided in the past to simply always return float64, even for rank method
s that could return ints.
For the ArrowDtype, then, it was decided to keep the dtype returned by pyarrow: #50264
For now, this PR updates StringDtype
to simply always return float64, to be consistent between the pyarrow vs python storage. But we could also consider, for those newer dtypes, to actually keep the distinction between float/int results (just not always int like it is done now).