nunique performance for groupby with large number of groups · Issue #10820 · pandas-dev/pandas (original) (raw)
It looks like len(set)
beats both len(np.unique)
and pd.Series.nunique
if done naively -- here's an example with a large number of groups where we try to compute unique counts of a column when grouping by another column:
df = pd.DataFrame({'a': np.random.randint(10000, size=100000), 'b': np.random.randint(10, size=100000)}) g = df.groupby('a')
%timeit g.b.nunique() 1 loops, best of 3: 1 s per loop
%timeit g.b.apply(pd.Series.nunique) 1 loops, best of 3: 992 ms per loop
%timeit g.b.apply(lambda x: np.unique(x.values).size) 1 loops, best of 3: 652 ms per loop
%timeit g.b.apply(lambda x: len(set(x.values))) 1 loops, best of 3: 469 ms per loop
The fastest way I know to accomplish the same thing is this:
g = df.groupby(['a', 'b'])
%timeit g.b.first().groupby(level=0).size() 100 loops, best of 3: 6.2 ms per loop
... which is a LOT faster apparently.
Wonder if something similar could be done in GroupBy.nunique
since it's quite a common use case?