PERF/BENCHMARK: comprehensive cat-groupby benchmarks · Issue #19026 · pandas-dev/pandas (original) (raw)

we have a number of groupby benchmarks with categoricals, but I think we need a comprehensive set to exercise combinations of:

groupby on cat/object columns
cython function (e.g. first/max/....)
.agg variants of cython functions

In [4]: import pandas as pd
   ...: import numpy as np
   ...: animals = ['Dog', 'Cat']
   ...: days = ['Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday','Saturday']
   ...: N = 1000000
   ...: df = pd.DataFrame({'animals': np.array(animals).take(np.random.randint(0, len(animals), size=N)),
   ...:                    'days': np.array(days).take(np.random.randint(0, len(days), size=N))})
   ...: df2 = df.copy()
   ...: df2['animals'] = df2['animals'].astype('category')
   ...: 
   ...: df3 = df2.copy()
   ...: df3['animals'] = df3['animals'].cat.codes
   ...: 
   ...: # group on object, aggregate cat
   ...: print('groupby on object')
   ...: %timeit df.groupby('days').agg({'animals': 'first'})
   ...: %timeit df2.groupby('days').agg({'animals': 'first'})
   ...: 
   ...: 
   ...: # group on cat, aggregate cat
   ...: print('groupby on cat / codes / agg')
   ...: %timeit df.groupby('animals').agg({'animals': 'first'})
   ...: %timeit df2.groupby('animals').agg({'animals': 'first'})
   ...: %timeit df3.groupby('animals').agg({'animals': 'first'})
   ...: 
   ...: print('groupby on cat / codes / cython')
   ...: %timeit df2.groupby('animals').first()
   ...: %timeit df3.groupby('animals').first()
   ...: 
[1] groupby on object
270 ms +- 5.22 ms per loop (mean +- std. dev. of 7 runs, 1 loop each)
118 ms +- 1.96 ms per loop (mean +- std. dev. of 7 runs, 10 loops each)
[2] groupby on cat / codes / agg
147 ms +- 2.53 ms per loop (mean +- std. dev. of 7 runs, 10 loops each)
69.1 ms +- 1.56 ms per loop (mean +- std. dev. of 7 runs, 10 loops each)
22.2 ms +- 838 us per loop (mean +- std. dev. of 7 runs, 10 loops each)
[3] groupby on cat / codes / cython
156 ms +- 4.32 ms per loop (mean +- std. dev. of 7 runs, 1 loop each)
169 ms +- 4.8 ms per loop (mean +- std. dev. of 7 runs, 1 loop each)