API: setitem copy/view behavior ndarray vs Categorical vs other EA · Issue #38896 · pandas-dev/pandas (original) (raw)

xref #33457 which is about similar issue but goes through different code paths.

In Block.setitem in cases where we are setting all the values for this block we have:

        elif exact_match and is_categorical_dtype(arr_value.dtype):
            # GH25495 - If the current dtype is not categorical, we need to create a new categorical block
            values[indexer] = value
            return self.make_block(Categorical(self.values, dtype=arr_value.dtype))

        elif exact_match and is_ea_value:
            # GH#32395 if we're going to replace the values entirely, just substitute in the new array
            return self.make_block(arr_value)

        elif exact_match:
            # We are setting _all_ of the array's values, so can cast to new dtype
            values[indexer] = value

            values = values.astype(arr_value.dtype, copy=False)

So we overwrite the existing values for categorical value or non-EA value. Example:

df = pd.DataFrame({
    "A": [.1, .2, .3],
    "B": pd.array([1, 2, None], dtype="Int64"),
    "C": ["a", "b", "c"]
})
orig_df = df[:]

arr_np = df["A"]._values
arr_ea = df["B"]._values
cat = pd.Categorical(df["C"])

# Note: there are many equivalent-looking ways of doing this setitem operation but few of them go through this code path.
df.loc[range(3), "A"] = arr_np[::-1]
df.loc[range(3), "B"] = arr_ea[::-1]
df.loc[range(3), "C"] = cat[::-1]

>>> df
     A     B  C
0  0.3  <NA>  c
1  0.2     2  b
2  0.1     1  a

>>> df.dtypes
A     float64
B       Int64
C    category
dtype: object

>>> orig_df
     A     B  C
0  0.3     1  a
1  0.2     2  b
2  0.1  <NA>  c

>>> orig_df.dtypes
A    float64
B      Int64
C     object
dtype: object

The categorical behavior we implemented in #23393 and AFAICT the over-writing behavior was not discussed/intentional. Similarly the other EA behavior was implemented in #32479 and I don't see anything about the overwrite-or-not. I haven't tracked down the origin of the non-EA behavior.

I think all three cases should have the same behavior. We should also have the same behavior for should-be-equivalent setters, e.g. if we used iloc instead of loc, or [:, "A"] instead of [range(3), "A"].

I think I agree with @TomAugspurger's comment that these should always be in-place, but not sure ATM if that can be done without breaking consistency elsewhere.