BUG/Perf: Support ExtensionArrays in where by TomAugspurger · Pull Request #24114 · pandas-dev/pandas (original) (raw)
Or what in setitem is not working as it should right now? (maybe I am missing some context)
@jorisvandenbossche OK, here's some context :) The most immediate failure is mismatched block dimensions / shapes for in DataFrame.where
for EA (not just DatetimeArray, but that was tested). This uses Categorical
In [8]: df = pd.DataFrame({"A": pd.Categorical([1, 2, 3])})
In [9]: df.where(pd.DataFrame({"A": [True, False, True]}))
ValueError Traceback (most recent call last) in ----> 1 df.where(pd.DataFrame({"A": [True, False, True]}))
...
~/sandbox/pandas/pandas/core/internals/blocks.py in init(self, values, placement, ndim) 84 raise ValueError( 85 'Wrong number of items passed {val}, placement implies ' ---> 86 '{mgr}'.format(val=len(self.values), mgr=len(self.mgr_locs))) 87 88 def _check_ndim(self, values, ndim):
ValueError: Wrong number of items passed 3, placement implies 1
The broadcasting is all messed up since the shapes aren't right (we're using Block.where
).
ipdb> cond array([[ True], [False], [ True]]) ipdb> values [1, 2, 3] Categories (3, int64): [1, 2, 3] ipdb> other nan
A hacky, but shorter fix is to use the following (this is in Block.where)
diff --git a/pandas/core/internals/blocks.py b/pandas/core/internals/blocks.py index 618b9eb12..2356a226d 100644 --- a/pandas/core/internals/blocks.py +++ b/pandas/core/internals/blocks.py @@ -1319,12 +1319,20 @@ class Block(PandasObject):
values = self.values
orig_other = other
if not self._can_consolidate:
transpose = False
if transpose: values = values.T other = getattr(other, '_values', getattr(other, 'values', other)) cond = getattr(cond, 'values', cond)
if not self._can_consolidate:
if cond.ndim == 2:
assert cond.shape[-1] == 1
cond = cond.ravel()
# If the default broadcasting would go in the wrong direction, then # explicitly reshape other instead if getattr(other, 'ndim', 0) >= 1:
That fixes most of the issues I was having on the DTA branch. Still running the tests to see if any were re-broken.
(edit: it's slightly more complicated, have to handle reshaping other
as well, so ~15 more LOC).
So, in summary
- we need to support EAs in DataFrame.where
- The diff just above is a hacky way of achieving just that (no more).
- This PR will still be useful for avoiding the conversion to object (DatetimeTZBlock will avoid it even without this PR, thanks to
_try_coerce_args
coercing datetimes to ints)