Boolean operations in groupby objects are extremely slow compared to numpy counterpart. (original) (raw)

%timeit -n 10000 np.random.choice([True, False], 10000).any() %timeit -n 10000 np.random.choice([True, False], 10000).sum().astype(bool)

10000 loops, best of 3: 83.7 µs per loop
10000 loops, best of 3: 106 µs per loop

import numpy as np import pandas as pd

df = pd.DataFrame({ "a": np.random.randint(0,20, 10000), "b": np.random.randint(0,20, 10000), "c": np.random.choice([True, False], 10000) })

%timeit -n 100 df.groupby(["a", "b"])["c"].any() %timeit -n 100 df.groupby(["a", "b"])["c"].sum().astype(bool)

100 loops, best of 3: 40.9 ms per loop
100 loops, best of 3: 1.46 ms per loop

Problem description

The issue here is that the any method for groupby objects seams to be freakishly slow. It is actually better to sum up all the boolean values and do a typecast with .astype(bool). In numpy the operations have similar benchmarks. The method with any is actually faster!.

INSTALLED VERSIONS ------------------ commit: None python: 3.5.2.final.0 python-bits: 64 OS: Linux OS-release: 4.4.0-53-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.19.2
nose: None
pip: 8.1.1
setuptools: None
Cython: None
numpy: 1.12.0
scipy: 0.18.1
statsmodels: None
xarray: None
IPython: 5.2.2
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: 2.0.0
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.5
boto: 2.45.0
pandas_datareader: None