BUG: filter (with dropna=False) when there are no groups fulfilling the condition · Issue #12768 · pandas-dev/pandas (original) (raw)

For a DataFrame I want to preserve rows that belong to groups that fulfil specific condition and replace other rows with NaN. I have used a combination of 'groupby' and 'filter' (with dropna=False). In a special case when there are no groups fulfilling the condition an exception occured.

AttributeError Traceback (most recent call last) in () ----> 1 pd.DataFrame({'a': [1,1,2], 'b':[1,2,0]}).groupby('a').filter(lambda x: x['b'].sum() > 5, dropna=False)

....../local/lib/python2.7/site-packages/pandas/core/groupby.py in filter(self, func, dropna, *args, **kwargs) 3570 type(res).name) 3571 -> 3572 return self._apply_filter(indices, dropna) 3573 3574

....../local/lib/python2.7/site-packages/pandas/core/groupby.py in _apply_filter(self, indices, dropna) 831 mask = np.empty(len(self._selected_obj.index), dtype=bool) 832 mask.fill(False) --> 833 mask[indices.astype(int)] = True 834 # mask fails to broadcast when passed to where; broadcast manually. 835 mask = np.tile(mask, list(self._selected_obj.shape[1:]) + [1]).T

AttributeError: 'list' object has no attribute 'astype'

The problem I have identified is in the _apply_filter method of _GroupBy class (core/groupby.py) -- line with "mask[indices.astype(int)] = True" throws because in my case indices is equal to []; shouldn't it be "indices = np.array([])" instead of "indices = []" in the case when len(indices) == 0

def _apply_filter(self, indices, dropna):
    if len(indices) == 0:
        indices = []
    else:
        indices = np.sort(np.concatenate(indices))
    if dropna:
        filtered = self._selected_obj.take(indices, axis=self.axis)
    else:
        mask = np.empty(len(self._selected_obj.index), dtype=bool)
        mask.fill(False)
        mask[indices.astype(int)] = True
        # mask fails to broadcast when passed to where; broadcast manually.
        mask = np.tile(mask, list(self._selected_obj.shape[1:]) + [1]).T
        filtered = self._selected_obj.where(mask)  # Fill with NaNs.
    return filtered

Code Sample, a copy-pastable example if possible

import pandas as pd pd.DataFrame({'a': [1,1,2], 'b': [1,2,0]}).groupby('a').filter(lambda x: x['b'].sum() > 5, dropna=False)

Expected Output

a   b

0 NaN NaN 1 NaN NaN 2 NaN NaN

output of pd.show_versions()

commit: None python: 2.7.9.final.0 python-bits: 64 OS: Linux OS-release: 3.19.0-56-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8

pandas: 0.18.0 nose: 1.3.7 pip: 1.5.6 setuptools: 12.2 Cython: 0.23.4 numpy: 1.11.0 scipy: 0.16.1 statsmodels: None xarray: None IPython: 4.0.3 sphinx: None patsy: 0.4.0 dateutil: 2.5.2 pytz: 2016.3 blosc: None bottleneck: None tables: None numexpr: None matplotlib: 1.5.1 openpyxl: None xlrd: None xlwt: None xlsxwriter: 0.7.6 lxml: None bs4: 4.3.2 html5lib: 0.999 httplib2: 0.9 apiclient: None sqlalchemy: None pymysql: 0.6.6.None psycopg2: None jinja2: 2.8 boto: None