API: setitem copy/view behavior ndarray vs Categorical vs other EA · Issue #38896 · pandas-dev/pandas (original) (raw)
xref #33457 which is about similar issue but goes through different code paths.
In Block.setitem
in cases where we are setting all the values for this block we have:
elif exact_match and is_categorical_dtype(arr_value.dtype):
# GH25495 - If the current dtype is not categorical, we need to create a new categorical block
values[indexer] = value
return self.make_block(Categorical(self.values, dtype=arr_value.dtype))
elif exact_match and is_ea_value:
# GH#32395 if we're going to replace the values entirely, just substitute in the new array
return self.make_block(arr_value)
elif exact_match:
# We are setting _all_ of the array's values, so can cast to new dtype
values[indexer] = value
values = values.astype(arr_value.dtype, copy=False)
So we overwrite the existing values for categorical value
or non-EA value
. Example:
df = pd.DataFrame({
"A": [.1, .2, .3],
"B": pd.array([1, 2, None], dtype="Int64"),
"C": ["a", "b", "c"]
})
orig_df = df[:]
arr_np = df["A"]._values
arr_ea = df["B"]._values
cat = pd.Categorical(df["C"])
# Note: there are many equivalent-looking ways of doing this setitem operation but few of them go through this code path.
df.loc[range(3), "A"] = arr_np[::-1]
df.loc[range(3), "B"] = arr_ea[::-1]
df.loc[range(3), "C"] = cat[::-1]
>>> df
A B C
0 0.3 <NA> c
1 0.2 2 b
2 0.1 1 a
>>> df.dtypes
A float64
B Int64
C category
dtype: object
>>> orig_df
A B C
0 0.3 1 a
1 0.2 2 b
2 0.1 <NA> c
>>> orig_df.dtypes
A float64
B Int64
C object
dtype: object
The categorical behavior we implemented in #23393 and AFAICT the over-writing behavior was not discussed/intentional. Similarly the other EA behavior was implemented in #32479 and I don't see anything about the overwrite-or-not. I haven't tracked down the origin of the non-EA behavior.
I think all three cases should have the same behavior. We should also have the same behavior for should-be-equivalent setters, e.g. if we used iloc instead of loc, or [:, "A"]
instead of [range(3), "A"]
.
I think I agree with @TomAugspurger's comment that these should always be in-place, but not sure ATM if that can be done without breaking consistency elsewhere.