DISCUSS/API: setitem-like operations should only update inplace and never fallback with upcast (i.e never change the dtype) · Issue #39584 · pandas-dev/pandas (original) (raw)
Currently, setitem-like operations (i.e. operations that change values in an existing series or dataframe such as __setitem__
and .loc
/.iloc
setitem, or filling methods like fillna
) first try to update in place, but if there is a dtype mismatch, pandas will upcast to a common dtype (typically object dtype).
For example, setting a string into an integer Series upcasts to object:
s = pd.Series([1, 2, 3]) s.loc[1] = "B" s 0 1 1 B 2 3 dtype: object
or doing a fillna
with an invalid fill value also upcasts instead of raising an error:
s = pd.Series(["2020-01-01", "NaT"], dtype="datetime64[ns]") s 0 2020-01-01 1 NaT dtype: datetime64[ns] s.fillna(1) 0 2020-01-01 00:00:00 1 1 dtype: object
My general proposal would be that in some future (eg pandas 2.0 + after a deprecation), such inherently inplace operation should have the guarantee to either happen in place or either error, and thus never change the dtype of the original Series/DataFrame.
This is similar to eg numpy's behaviour where setitem never changes the dtype. Showing the first example from above in equivalent numpy code:
arr = np.array([1, 2, 3]) arr[1] = "B" ... ValueError: invalid literal for int() with base 10: 'B'
Apart from that, I also think this is the cleaner behaviour with less surprises. If a user specifically wants to allow mixed types in a column, they can manually cast to object
dtype first.
On the other hand, this is quite a big change in how we generally are permissive right now and easily upcast, and such a change will certainly impact quite some user code (but, it's perfectly possible to do this with proper deprecation warnings in advance warning for the specific cases where it will error in the future AFAIK).
There are certainly some more details that need to discussed as well if we want this (which exact values are regarded as compatible with the dtype, eg setting a float in an integer column, should that error or silently round the float?). But what are people's thoughts on the general idea?
cc @pandas-dev/pandas-core