BUG/Perf: Support ExtensionArrays in where by TomAugspurger · Pull Request #24114 · pandas-dev/pandas (original) (raw)

Or what in setitem is not working as it should right now? (maybe I am missing some context)

@jorisvandenbossche OK, here's some context :) The most immediate failure is mismatched block dimensions / shapes for in DataFrame.where for EA (not just DatetimeArray, but that was tested). This uses Categorical

In [8]: df = pd.DataFrame({"A": pd.Categorical([1, 2, 3])})

In [9]: df.where(pd.DataFrame({"A": [True, False, True]}))


ValueError Traceback (most recent call last) in ----> 1 df.where(pd.DataFrame({"A": [True, False, True]}))

...

~/sandbox/pandas/pandas/core/internals/blocks.py in init(self, values, placement, ndim) 84 raise ValueError( 85 'Wrong number of items passed {val}, placement implies ' ---> 86 '{mgr}'.format(val=len(self.values), mgr=len(self.mgr_locs))) 87 88 def _check_ndim(self, values, ndim):

ValueError: Wrong number of items passed 3, placement implies 1

The broadcasting is all messed up since the shapes aren't right (we're using Block.where).

ipdb> cond array([[ True], [False], [ True]]) ipdb> values [1, 2, 3] Categories (3, int64): [1, 2, 3] ipdb> other nan

A hacky, but shorter fix is to use the following (this is in Block.where)

diff --git a/pandas/core/internals/blocks.py b/pandas/core/internals/blocks.py index 618b9eb12..2356a226d 100644 --- a/pandas/core/internals/blocks.py +++ b/pandas/core/internals/blocks.py @@ -1319,12 +1319,20 @@ class Block(PandasObject):

     values = self.values
     orig_other = other

That fixes most of the issues I was having on the DTA branch. Still running the tests to see if any were re-broken.
(edit: it's slightly more complicated, have to handle reshaping other as well, so ~15 more LOC).


So, in summary

  1. we need to support EAs in DataFrame.where
  2. The diff just above is a hacky way of achieving just that (no more).
  3. This PR will still be useful for avoiding the conversion to object (DatetimeTZBlock will avoid it even without this PR, thanks to _try_coerce_args coercing datetimes to ints)