REGR: setting column with setitem should not modify existing array inplace · Issue #33457 · pandas-dev/pandas (original) (raw)
So consider this example of a small dataframe with a nullable integer column:
def recreate_df():
return pd.DataFrame({'int': [1, 2, 3], 'int2': [3, 4, 5],
'float': [.1, .2, .3],
'EA': pd.array([1, 2, None], dtype="Int64")
})
Assigning a new column with __setitem__
(df[col] = ...
) normally does not even preserve the dtype:
In [2]: df = recreate_df()
...: df['EA'] = np.array([1., 2., 3.])
...: df['EA'].dtype
Out[2]: dtype('float64')
In [3]: df = recreate_df()
...: df['EA'] = np.array([1, 2, 3])
...: df['EA'].dtype
Out[3]: dtype('int64')
When assigning a new nullable integer array, it of course keeps the dtype of the assigned values:
In [4]: df = recreate_df()
...: df['EA'] = pd.array([1, 2, 3], dtype="Int64")
...: df['EA'].dtype
Out[4]: Int64Dtype()
However, in this case you now also have the tricky side-effect of being in place:
In [5]: df = recreate_df()
...: original_arr = df.EA.array
...: df['EA'] = pd.array([1, 2, 3], dtype="Int64")
...: original_arr is df.EA.array
Out[5]: True
I don't think this behaviour should depend on the values being set, and setitem should always replace the array of the ExtensionBlock.
Because with the above way, you can unexpectedly alter the data with which you created the dataframe. See also a different example using Categorical of this at the original PR that introduced this: #32831 (comment)