ENH: Add lazy copy to astype by phofl · Pull Request #50802 · pandas-dev/pandas (original) (raw)
So the reason that the decimal test you added works without actual code change is because the ExtensionDtype.kind attribute for that dtype is also object (that's the default on the base class ExtensionDtype). And then you get the elif is_object_dtype(dtype) and new_dtype.kind == "O":
returning True, so we keep track of references. But if the kind
is not object, we might end up in the final return False
, and we might be incorrectly assuming the result is a copy and not keep track of references.
And I think that all our own extension dtypes (that are not the nullable ones) have a kind
of "O" (categorical, period, interval, ..), and so those don't run into the issue.
We have one test extension dtype that doesn't have to be of kind
object, and that is the FloatAttrDtype
. If I update that dtype to have the correct kind
of float:
diff --git a/pandas/tests/extension/array_with_attr/array.py b/pandas/tests/extension/array_with_attr/array.py index d9327ca9f2..94923b1b32 100644 --- a/pandas/tests/extension/array_with_attr/array.py +++ b/pandas/tests/extension/array_with_attr/array.py @@ -20,6 +20,7 @@ class FloatAttrDtype(ExtensionDtype): type = float name = "float_attr" na_value = np.nan
kind = "f"
@classmethod def construct_array_type(cls) -> type_t[FloatAttrArray]:
then I observe the following bug with the current branch:
In [1]: from pandas.tests.extension.array_with_attr import FloatAttrDtype
In [2]: s = pd.Series([1, 2, 3], dtype="float")
In [3]: s2 = s.astype(FloatAttrDtype())
In [4]: s[0] = 10.0 # <- this is fine, s2 not changed (since a copy was made)
In [5]: s2
Out[5]:
0 1.0
1 2.0
2 3.0
dtype: float_attr
In [6]: pd.options.mode.copy_on_write = True
In [7]: s = pd.Series([1, 2, 3], dtype="float")
In [8]: s2 = s.astype(FloatAttrDtype())
In [9]: s[0] = 10.0 # but with CoW enabled, this does mutate s2
In [10]: s2
Out[10]:
0 10.0
1 2.0
2 3.0
dtype: float_attr
I don't know if there are many EAs out there that would run into this, though .. But when we end up calling EA._from_sequence
(calling astype(EAdtype) with a dtype we don't implement ourselves), I think we should either assume that the result can be view (and so track refs), or either in this case do pass copy=True
to _from_sequence
, instead of copy=False
now when CoW is enabled.