API: consistent sanitize_array-wrapping for lists in arithmetic ops by jbrockmendel · Pull Request #62552 · pandas-dev/pandas (original) (raw)
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters
[ Show hidden characters]({{ revealButtonHref }})
The alternative to np.array is sanitize_array (effectively pd.Series(the_list)._values). That would break the Int64 case bc the None would be cast to NaN.
Discussed on the dev call this week to (I think) general approval that np.array-wrapping is the simplest, most consistent behavior available and that users who want something else can pretty easily do that themselves.
That said, I've found three issues where this would enshrine behavior that users complain about: #54554, #62524, #62353. In each of those cases the list contains datetimes or timedeltas. Calling np.array on it returns object dtype, where the users expected dt64 or td64.
I still think np.array is the right move here, but want reviewers to be aware of the tradeoffs.
Added 3.0 milestone bc it is technically a breaking change.
Dangit, wrapping in np.array introduces an inconsistency in DataFrame-vs-EverythingElse behavior. If we do a DataFrame op with a list, it passes that list to the Series constructor, which has different inference behavior from np.array (e.g. on [pd.Timedelta(1)] or [1, None, 3]).
I think we need to do Series-constructor-like behavior across the board.
Double dang. Now I remember why I said "no" last time I tried sanitize_array. We run into NotImplemented trouble. e.g. with
left = TimedeltaArray(...)
right = [list_of_timestamps]
left + right
left+right will go through TimedeltaArray.__add__ which will call _unpack_zerodim_and_defer, which will wrap the list in DatetimeArray and call TDA.__add__(the_datetimearray), which will return NotImplemented bc the actual logic lives in DatetimeArray.__add__. But instead of deferring to DTA.__add__, that will defer to list.__add__, which of course returns NotImplemented, so we get a TypeError.
I still think np.array is the right move here, but want reviewers to be aware of the tradeoffs.
Implementations details and complexities aside, I think conceptually "the right thing" would be to convert to an array like np.array(..) but then following our own conversion logic (as you would have in pd.Series(..), or as you would have in pd.array() except that that does not default to use the default dtypes, ..)
I assume that is the Series-constructor-like / sanitize_array you mention above? (and where you run into issues ..)
jbrockmendel changed the title
REF: consistent ndarray-wrapping for lists in arithmetic ops API: consistent sanitize_array-wrapping for lists in arithmetic ops
Comment on lines +2221 to +2224
| # as of GH#62522 the comparison op for `res==data` casts data |
|---|
| # using sanitize_array, which casts to 'str' dtype, which does not |
| # consider string 'nan' to be equal to np.nan, |
| # (which apparently numpy does? weird.) |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose if/when we support np.StringDtype, we can treat the data as masked and then we hopefully won't be comparing 'nan' to np.nan?
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think that's a design decision we can put off. my expectation would be that we'd defer to whatever design decisions numpy makes.
cmp0xff added a commit to cmp0xff/pandas-stubs that referenced this pull request
Dr-Irv pushed a commit to pandas-dev/pandas-stubs that referenced this pull request