ENH: Add lazy copy to astype by phofl · Pull Request #50802 · pandas-dev/pandas (original) (raw)

So the reason that the decimal test you added works without actual code change is because the ExtensionDtype.kind attribute for that dtype is also object (that's the default on the base class ExtensionDtype). And then you get the elif is_object_dtype(dtype) and new_dtype.kind == "O": returning True, so we keep track of references. But if the kind is not object, we might end up in the final return False, and we might be incorrectly assuming the result is a copy and not keep track of references.

And I think that all our own extension dtypes (that are not the nullable ones) have a kind of "O" (categorical, period, interval, ..), and so those don't run into the issue.
We have one test extension dtype that doesn't have to be of kind object, and that is the FloatAttrDtype. If I update that dtype to have the correct kind of float:

diff --git a/pandas/tests/extension/array_with_attr/array.py b/pandas/tests/extension/array_with_attr/array.py index d9327ca9f2..94923b1b32 100644 --- a/pandas/tests/extension/array_with_attr/array.py +++ b/pandas/tests/extension/array_with_attr/array.py @@ -20,6 +20,7 @@ class FloatAttrDtype(ExtensionDtype): type = float name = "float_attr" na_value = np.nan

then I observe the following bug with the current branch:

In [1]: from pandas.tests.extension.array_with_attr import FloatAttrDtype

In [2]: s = pd.Series([1, 2, 3], dtype="float")

In [3]: s2 = s.astype(FloatAttrDtype())

In [4]: s[0] = 10.0   # <- this is fine, s2 not changed (since a copy was made)

In [5]: s2
Out[5]: 
0    1.0
1    2.0
2    3.0
dtype: float_attr

In [6]: pd.options.mode.copy_on_write = True

In [7]: s = pd.Series([1, 2, 3], dtype="float")

In [8]: s2 = s.astype(FloatAttrDtype())

In [9]: s[0] = 10.0   # but with CoW enabled, this does mutate s2

In [10]: s2
Out[10]: 
0    10.0
1     2.0
2     3.0
dtype: float_attr

I don't know if there are many EAs out there that would run into this, though .. But when we end up calling EA._from_sequence (calling astype(EAdtype) with a dtype we don't implement ourselves), I think we should either assume that the result can be view (and so track refs), or either in this case do pass copy=True to _from_sequence, instead of copy=False now when CoW is enabled.