PERF/BENCHMARK: comprehensive cat-groupby benchmarks · Issue #19026 · pandas-dev/pandas (original) (raw)
we have a number of groupby benchmarks with categoricals, but I think we need a comprehensive set to exercise combinations of:
groupby on cat/object columns
cython function (e.g. first/max/....
).agg
variants of cython functions
In [4]: import pandas as pd
...: import numpy as np
...: animals = ['Dog', 'Cat']
...: days = ['Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday','Saturday']
...: N = 1000000
...: df = pd.DataFrame({'animals': np.array(animals).take(np.random.randint(0, len(animals), size=N)),
...: 'days': np.array(days).take(np.random.randint(0, len(days), size=N))})
...: df2 = df.copy()
...: df2['animals'] = df2['animals'].astype('category')
...:
...: df3 = df2.copy()
...: df3['animals'] = df3['animals'].cat.codes
...:
...: # group on object, aggregate cat
...: print('groupby on object')
...: %timeit df.groupby('days').agg({'animals': 'first'})
...: %timeit df2.groupby('days').agg({'animals': 'first'})
...:
...:
...: # group on cat, aggregate cat
...: print('groupby on cat / codes / agg')
...: %timeit df.groupby('animals').agg({'animals': 'first'})
...: %timeit df2.groupby('animals').agg({'animals': 'first'})
...: %timeit df3.groupby('animals').agg({'animals': 'first'})
...:
...: print('groupby on cat / codes / cython')
...: %timeit df2.groupby('animals').first()
...: %timeit df3.groupby('animals').first()
...:
[1] groupby on object
270 ms +- 5.22 ms per loop (mean +- std. dev. of 7 runs, 1 loop each)
118 ms +- 1.96 ms per loop (mean +- std. dev. of 7 runs, 10 loops each)
[2] groupby on cat / codes / agg
147 ms +- 2.53 ms per loop (mean +- std. dev. of 7 runs, 10 loops each)
69.1 ms +- 1.56 ms per loop (mean +- std. dev. of 7 runs, 10 loops each)
22.2 ms +- 838 us per loop (mean +- std. dev. of 7 runs, 10 loops each)
[3] groupby on cat / codes / cython
156 ms +- 4.32 ms per loop (mean +- std. dev. of 7 runs, 1 loop each)
169 ms +- 4.8 ms per loop (mean +- std. dev. of 7 runs, 1 loop each)