BUG: filter (with dropna=False) when there are no groups fulfilling the condition · Issue #12768 · pandas-dev/pandas (original) (raw)
For a DataFrame I want to preserve rows that belong to groups that fulfil specific condition and replace other rows with NaN. I have used a combination of 'groupby' and 'filter' (with dropna=False). In a special case when there are no groups fulfilling the condition an exception occured.
AttributeError Traceback (most recent call last) in () ----> 1 pd.DataFrame({'a': [1,1,2], 'b':[1,2,0]}).groupby('a').filter(lambda x: x['b'].sum() > 5, dropna=False)
....../local/lib/python2.7/site-packages/pandas/core/groupby.py in filter(self, func, dropna, *args, **kwargs) 3570 type(res).name) 3571 -> 3572 return self._apply_filter(indices, dropna) 3573 3574
....../local/lib/python2.7/site-packages/pandas/core/groupby.py in _apply_filter(self, indices, dropna) 831 mask = np.empty(len(self._selected_obj.index), dtype=bool) 832 mask.fill(False) --> 833 mask[indices.astype(int)] = True 834 # mask fails to broadcast when passed to where; broadcast manually. 835 mask = np.tile(mask, list(self._selected_obj.shape[1:]) + [1]).T
AttributeError: 'list' object has no attribute 'astype'
The problem I have identified is in the _apply_filter method of _GroupBy class (core/groupby.py) -- line with "mask[indices.astype(int)] = True" throws because in my case indices is equal to []; shouldn't it be "indices = np.array([])" instead of "indices = []" in the case when len(indices) == 0
def _apply_filter(self, indices, dropna):
if len(indices) == 0:
indices = []
else:
indices = np.sort(np.concatenate(indices))
if dropna:
filtered = self._selected_obj.take(indices, axis=self.axis)
else:
mask = np.empty(len(self._selected_obj.index), dtype=bool)
mask.fill(False)
mask[indices.astype(int)] = True
# mask fails to broadcast when passed to where; broadcast manually.
mask = np.tile(mask, list(self._selected_obj.shape[1:]) + [1]).T
filtered = self._selected_obj.where(mask) # Fill with NaNs.
return filtered
Code Sample, a copy-pastable example if possible
import pandas as pd pd.DataFrame({'a': [1,1,2], 'b': [1,2,0]}).groupby('a').filter(lambda x: x['b'].sum() > 5, dropna=False)
Expected Output
a b
0 NaN NaN 1 NaN NaN 2 NaN NaN
output of pd.show_versions()
commit: None python: 2.7.9.final.0 python-bits: 64 OS: Linux OS-release: 3.19.0-56-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8
pandas: 0.18.0 nose: 1.3.7 pip: 1.5.6 setuptools: 12.2 Cython: 0.23.4 numpy: 1.11.0 scipy: 0.16.1 statsmodels: None xarray: None IPython: 4.0.3 sphinx: None patsy: 0.4.0 dateutil: 2.5.2 pytz: 2016.3 blosc: None bottleneck: None tables: None numexpr: None matplotlib: 1.5.1 openpyxl: None xlrd: None xlwt: None xlsxwriter: 0.7.6 lxml: None bs4: 4.3.2 html5lib: 0.999 httplib2: 0.9 apiclient: None sqlalchemy: None pymysql: 0.6.6.None psycopg2: None jinja2: 2.8 boto: None