API: common dtype for bool + numeric: upcast to object or coerce to numeric? · Issue #39817 · pandas-dev/pandas (original) (raw)

We currently have an inconsistency in how we determine the common dtype for bool + numeric.

Numpy coerces booleans to numeric values when combining with numeric dtype:

np.concatenate([np.array([True]), np.array([1])]) array([1, 1])

In pandas, Series does the same:

pd.concat([pd.Series([True], dtype=bool), pd.Series([1.0], dtype=float)]) 0 1.0 0 1.0 dtype: float64

except if they are empty, then we ensure the result is object dtype:

pd.concat([pd.Series([], dtype=bool), pd.Series([], dtype=float)]) Series([], dtype: object)

And for DataFrame we return object dtype in all cases:

pd.concat([pd.DataFrame({'a': np.array([], dtype=bool)}), pd.DataFrame({'a': np.array([], dtype=float)})]).dtypes a object dtype: object pd.concat([pd.DataFrame({'a': np.array([True], dtype=bool)}), pd.DataFrame({'a': np.array([1.0], dtype=float)})]).dtypes a object dtype: object

For the nullable dtypes, we also have implemented this a bit inconsistently up to now:

pd.concat([pd.Series([True], dtype="boolean"), pd.Series([1], dtype="Int64")]) 0 1 0 1 dtype: Int64

pd.concat([pd.Series([True], dtype="boolean"), pd.Series([1], dtype="Float64")]) 0 True 0 1.0 dtype: object

So here we preserve numeric dtype for Integer, but convert to object for Float. Now, the reason for this is because IntegerDtype._get_common_dtype handles the case of boolean dtype and then uses the numpy rules to determine the result dtype, while the FloatingDtype doesn't yet handle non-float dtypes and thus results in object dtype for anything else (also for float numpy dtype, which is obviously a bug / missing feature)

Basically we need to decide what the desired behaviour is for the bool + numeric dtype combination: coerce to numeric or upcast to object? (and then fix the inconsistencies according to the decided rule)