nunique performance for groupby with large number of groups · Issue #10820 · pandas-dev/pandas (original) (raw)

It looks like len(set) beats both len(np.unique) and pd.Series.nunique if done naively -- here's an example with a large number of groups where we try to compute unique counts of a column when grouping by another column:

df = pd.DataFrame({'a': np.random.randint(10000, size=100000), 'b': np.random.randint(10, size=100000)}) g = df.groupby('a')

%timeit g.b.nunique() 1 loops, best of 3: 1 s per loop

%timeit g.b.apply(pd.Series.nunique) 1 loops, best of 3: 992 ms per loop

%timeit g.b.apply(lambda x: np.unique(x.values).size) 1 loops, best of 3: 652 ms per loop

%timeit g.b.apply(lambda x: len(set(x.values))) 1 loops, best of 3: 469 ms per loop

The fastest way I know to accomplish the same thing is this:

g = df.groupby(['a', 'b'])

%timeit g.b.first().groupby(level=0).size() 100 loops, best of 3: 6.2 ms per loop

... which is a LOT faster apparently.

Wonder if something similar could be done in GroupBy.nunique since it's quite a common use case?