API: how to handle NA in conversion to numpy arrays (original) (raw)

In #29964 and #29961 NA in IntegerArray and BooleanArray), the question comes up how to handle pd.NA's in conversion to numpy arrays.

Such conversion occurs mainly in __array__ (for np.(as)array(..)) and .astype(). For example:

In [3]: arr = pd.array([1, 2, pd.NA], dtype="Int64")

In [4]: np.asarray(arr) Out[4]: array([1, 2, None/pd.NA/..?], dtype=object)

In [5]: arr.astype(float)
Out[5]: array([ 1., 2., nan]) # <--- allow automatic NA to NaN conversion?

Questions that come up here:

By default, when converting to object dtype, what "NA value" should be used? Before this was NaN or None, now it could logically be pd.NA.
A possible reason to choose None instead of pd.NA is that third party code that needs a numpy array will typically not be able to handle pd.NA while None is much more normal. On the other hand, there is also still time for such third party code to adapt. And it will probably be good to keep list(arr) (iteration/getitem) and np.array(arr, dtype=object) consisetnt.
When converting to a float dtype, are we fine to automatically convert pd.NA to np.nan ? Or do we think the user should explicitly opt in for this?

We will probably want to add a to_numpy to those Integer/BooleanArray to be able to make those choices explicit, eg with following signature:

def to_numpy(self, dtype=object, na_value=...):
    ...

where you can explicitly say which value to use for the NAs in the final numpy array (and the Series.numpy can then forward such keyword).
That way, a user can do arr.to_numpy(dtype=object, na_value=None) to get a numpy array with None instead of pd.NA, or arr.to_numpy(dtype=float, na_value=np.nan) to get a float array with NaNs.

But even if we have that function (which I think we should), the above questions about the defaults are still to be answered (eg for __array__ we cannot have such a na_value keyword, so we need to make a default choice).

cc @TomAugspurger @Dr-Irv