API: common dtype for bool + numeric: upcast to object or coerce to numeric? · Issue #39817 · pandas-dev/pandas (original) (raw)
We currently have an inconsistency in how we determine the common dtype for bool + numeric.
Numpy coerces booleans to numeric values when combining with numeric dtype:
np.concatenate([np.array([True]), np.array([1])]) array([1, 1])
In pandas, Series does the same:
pd.concat([pd.Series([True], dtype=bool), pd.Series([1.0], dtype=float)]) 0 1.0 0 1.0 dtype: float64
except if they are empty, then we ensure the result is object dtype:
pd.concat([pd.Series([], dtype=bool), pd.Series([], dtype=float)]) Series([], dtype: object)
And for DataFrame we return object dtype in all cases:
pd.concat([pd.DataFrame({'a': np.array([], dtype=bool)}), pd.DataFrame({'a': np.array([], dtype=float)})]).dtypes a object dtype: object pd.concat([pd.DataFrame({'a': np.array([True], dtype=bool)}), pd.DataFrame({'a': np.array([1.0], dtype=float)})]).dtypes a object dtype: object
For the nullable dtypes, we also have implemented this a bit inconsistently up to now:
pd.concat([pd.Series([True], dtype="boolean"), pd.Series([1], dtype="Int64")]) 0 1 0 1 dtype: Int64
pd.concat([pd.Series([True], dtype="boolean"), pd.Series([1], dtype="Float64")]) 0 True 0 1.0 dtype: object
So here we preserve numeric dtype for Integer, but convert to object for Float. Now, the reason for this is because IntegerDtype._get_common_dtype
handles the case of boolean dtype and then uses the numpy rules to determine the result dtype, while the FloatingDtype doesn't yet handle non-float dtypes and thus results in object dtype for anything else (also for float numpy dtype, which is obviously a bug / missing feature)
Basically we need to decide what the desired behaviour is for the bool + numeric dtype combination: coerce to numeric or upcast to object? (and then fix the inconsistencies according to the decided rule)