API/BUG: always try to operate inplace when setting with loc/iloc[foo, bar] by jbrockmendel · Pull Request #39163 · pandas-dev/pandas (original) (raw)
- closes API: setitem copy/view behavior ndarray vs Categorical vs other EA #38896
- tests added / passed
- Ensure all linting tests pass, see here for how to run them
- whatsnew entry
(The PR title should have a caveat "in cases that make it to BlockManager.setitem
)
Discussed briefly on today's call. The main idea, as discussed in #38896 and #33457, is that df[col] = ser
should not alter the existing df[col]._values
, while df.iloc[:, num] = ser
should always try to operate inplace before falling back to casting. This PR focuses on the latter case.
ATM, restricting to cases where we a) get to Block.setitem, b) could do the setitem inplace, and c) are setting all the entries in the array, we have 4 different behaviors. Examples are posted at the bottom of this post.
This PR changes Block.setitem so that in the _can_hold_element case we always operate inplace, always retain the same underlying array.
Existing Behavior
- If the
new_values
being set are categorical, we overwrite existing values and then discard them, do not get a view on new_values. (only can replicate with Series until BUG: setting categorical values into object dtype DataFrame #39136)
ser = pd.Series(np.arange(5), dtype=object)
cat = pd.Categorical(ser)[::-1]
vals = ser.values
ser.iloc[ser.index] = cat
assert (vals == cat).all() # <-- vals are overwritten
assert ser.dtype == cat.dtype # < -- vals are discarded
assert not np.shares_memory(ser._values.codes, cat.codes) # <-- we also dont get a view on the new values
- If the new_values are any other EA, we do not overwrite existing values and do get a view on the new_values.
df = pd.DataFrame({"A": np.arange(5, dtype=np.int32)})
arr = pd.array(df["A"].values) + 1
vals = df.values
df.iloc[df.index, 0] = arr
assert (df.dtypes == arr.dtype).all() # <-- cast
assert not (vals == df).any(axis=None) # <-- did not overwrite original
- If the new_values are a new non-EA dtype, we overwrite the old values and create a new array, get a view on neither.
df = tm.makeDataFrame() # <-- float64
old = df.values
new = np.arange(df.size).astype(np.int16).reshape(df.shape)
df.iloc[:, [0, 1, 2, 3]] = new
assert (old == new).all()
assert not np.shares_memory(df.values, old)
assert not np.shares_memory(df.values, new)
- If the new_values have the same dtype as the existing, we overwrite existing and keep the same array
df = tm.makeDataFrame() # <-- float64
old = df.values
new = np.arange(df.size).astype(np.float64).reshape(df.shape)
df.iloc[:, [0, 1, 2, 3]] = new
assert (old == new).all()
assert np.shares_memory(df.values, old)
assert not np.shares_memory(df.values, new)