PERF: improves performance in SeriesGroupBy.count by behzadnouri · Pull Request #10946 · pandas-dev/pandas (original) (raw)

In [4]: ts Out[4]: a 1 0 2 1 b 2 2 NaN 3 c 1 4 2 5 dtype: int64

In [5]: ts.count(level=1) Out[5]: 1 2 2 4 # <<< BUG! dtype: int64

In [6]: from string import ascii_lowercase

In [7]: np.random.seed(2718281)

In [8]: n = 1 << 21

In [9]: df = DataFrame({ ...: '1st':np.random.choice(list(ascii_lowercase), n), ...: '2nd':np.random.randint(0, n // 100, n), ...: '3rd':np.random.randn(n).round(3)})

In [10]: df.loc[np.random.choice(n, n // 10), '3rd'] = np.nan

In [11]:

In [11]: gr = df.groupby(['1st', '2nd'])['3rd']

In [12]: %timeit gr.count() The slowest run took 6.67 times longer than the fastest. This could mean that an intermediate result is being cached 1 loops, best of 3: 86.4 ms per loop

In [13]: %timeit gr.count() 10 loops, best of 3: 87 ms per loop

In [5]: ts.count(level=1) Out[5]: 1 2 2 3 NaN 1 dtype: int64

...

In [12]: %timeit gr.count() The slowest run took 12.29 times longer than the fastest. This could mean that an intermediate result is being cached 1 loops, best of 3: 43.1 ms per loop

In [13]: %timeit gr.count() 10 loops, best of 3: 43.5 ms per loop