BUG: ensure reindex / getitem to select columns properly copies data for extension dtypes by jorisvandenbossche · Pull Request #51197 · pandas-dev/pandas (original) (raw)
I encountered this while writing more tests for Copy-on-Write. Currently, the general rule is that selecting columns with a list-like indexer using getitem gives a copy:
df = pd.DataFrame(np.random.randn(10, 4), columns=['a', 'b', 'c', 'd']) subset = df[["a", "b"]]
subset is a copy
subset.iloc[0, 0] = 0 assert df.iloc[0, 0] != 0
However, that doesn't seem to be the case when the columns we select are extension dtypes. When using dtype="Float64"
in the above example, the original df
gets updated because subset isn't a copy.
While I am not sure this is an explicitly documented rule (AFAIK this is de-facto behaviour, and only described as such in the discussions related to copy/view and CoW), I do think it would be expected that extension dtypes behave the same as numpy dtypes on this front.
(it also makes writing tests for copy/view behaviour harder if the behaviour doesn't only change for CoW or not, but also depending on numpy vs extension dtypes. This is where I encountered the issue)
- Tests added and passed if fixing a bug or adding a new feature
- All code checks passed.
- Added type annotations to new arguments/methods/functions.
- Added an entry in the latest
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.