PERF: don't sort data twice in groupby apply when not using libreduction fast_apply by jorisvandenbossche · Pull Request #40176 · pandas-dev/pandas (original) (raw)
See #40171 (comment) for context, noticed that we were calling splitter._get_sorted_data() twice when using the non-fast_apply fallback.
Using the benchmark case from groupby.Apply.time_scalar_function_single/multi_col (like in #40171 (comment)), but then with bigger data (10 ** 6 instead of 10 ** 4):
N = 10 ** 6
labels = np.random.randint(0, 2000, size=N)
labels2 = np.random.randint(0, 3, size=N)
df = DataFrame(
{
"key": labels,
"key2": labels2,
"value1": np.random.randn(N),
"value2": ["foo", "bar", "baz", "qux"] * (N // 4),
}
)
df_am = df._as_manager("array")
In [2]: %timeit df_am.groupby("key").apply(lambda x: 1)
252 ms ± 17.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) <-- master
166 ms ± 5.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) <-- PR