API: how to handle NA in conversion to numpy arrays (original) (raw)

In #29964 and #29961 NA in IntegerArray and BooleanArray), the question comes up how to handle pd.NA's in conversion to numpy arrays.

Such conversion occurs mainly in __array__ (for np.(as)array(..)) and .astype(). For example:

In [3]: arr = pd.array([1, 2, pd.NA], dtype="Int64")

In [4]: np.asarray(arr) Out[4]: array([1, 2, None/pd.NA/..?], dtype=object)

In [5]: arr.astype(float)
Out[5]: array([ 1., 2., nan]) # <--- allow automatic NA to NaN conversion?

Questions that come up here:

We will probably want to add a to_numpy to those Integer/BooleanArray to be able to make those choices explicit, eg with following signature:

def to_numpy(self, dtype=object, na_value=...):
    ... 

where you can explicitly say which value to use for the NAs in the final numpy array (and the Series.numpy can then forward such keyword).
That way, a user can do arr.to_numpy(dtype=object, na_value=None) to get a numpy array with None instead of pd.NA, or arr.to_numpy(dtype=float, na_value=np.nan) to get a float array with NaNs.

But even if we have that function (which I think we should), the above questions about the defaults are still to be answered (eg for __array__ we cannot have such a na_value keyword, so we need to make a default choice).

cc @TomAugspurger @Dr-Irv