Use new NA scalar in BooleanArray by jorisvandenbossche · Pull Request #29961 · pandas-dev/pandas (original) (raw)

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

Follow-up on #29597 and #29555, now actually using the pd.NA scalar in BooleanArray.

This was referenced

Dec 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM at a glance. Request for a few more tests / confirmation that these are already testsed:

Test to enure that array([True, False, None, np.nan pd.NA], dtype="boolean") correctly sanitizes all the NA-like value to be NA.
Test in BooleanArray.__setitem__ ensuring that arr[0] = np.nan, etc., always inserts NA.

One question that comes up here (but the same question applies to string and integer arrays): what "NA value" do we want to use when converting to object dtype numpy arrays? None or pd.NA ?

In the initial BooleanArray PR, I used None (since pd.NA was not there yet), so you get a numpy array like np.array([True, False, None]).
Now we can start using pd.NA, which is closer to the pandas representation (and what you would get from iteration or conversion to list np.array([i for i in arr])). But on the other hand, None can be easier to handle for cases where you need a numpy array (functionality that needs numpy arrays will typically also not recognize or handle correctly pd.NA).

I'm not sure, but my initial preference is for having pd.NA, with an option to get other values for NA upon request. I think that means we should have a somewhat standard to_numpy method

def to_numpy(self, dtype=object, na_value=pd.NA): ...

Then if the user wants None / NaN, they can request it relatively easily.

do we want to use when converting to object dtype numpy arrays?

What are the user-actions that hit this?

np.array(boolarray, dtype=...)
boolarray.astype(np_dtype)
...?

Examples where the pd.NA in a numpy array gives problems: #29976 (pyarrow's C conversion code does not know it) and the factorize errors in #29964 (our cython hashing code for object arrays cannot handle it).

Now, both are easy to solve (certainly since both can easily be solved on our side; the hashing code can recognize pd.NA and for the conversion to pyarrow we can ensure to use None before passing to pyarrow). But it are examples of how other code can break.

Still not sure that we should therefore use None as the default in __array__, but at least I think it is important to have this to_numpy(.., na_value=..).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the "what value to use when converting", I'm not sure that downstream projects not understanding pd.NA should drive our decision here. We'll be living with this decision for a while, and projects will have time to adapt.


def _coerce_to_ndarray(self, force_bool: bool = False):
def _coerce_to_ndarray(
self, force_bool: bool = False, na_value: "Scalar" = lib._no_default

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to prefer lib_no_default to just libmissing.NA directly?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, not sure. I was probably sidetracked by the idea I could not use None as default as that is a valid value as well ..
(if we want to have this generic / shared with other arrays, using self._na_value might be useful, but I don't think we will share this with arrays that don't use pd.NA, so ..)

This was referenced

Dec 4, 2019

See #30043 for the CI failures. OK to ignore for now.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can move forward with this, despite the ongoing API discussion about .to_numpy / .astype(float) / np.asarray(arr), since that will be affecting all IntegerArray / StringArray / BooleanArray.