REGR: groupby.transform with a UDF performance · Issue #55256 · pandas-dev/pandas (original) (raw)

pd.options.mode.copy_on_write = False  # True
size = 10_000
df = pd.DataFrame(
    {
        'a': np.random.randint(0, 100, size),
        'b': np.random.randint(0, 100, size),
        'c': np.random.randint(0, 100, size),
    }
).set_index(['a', 'b']).sort_index()

gb = df.groupby(['a', 'b'])

%timeit gb.transform(lambda x: x == x.shift(-1).fillna(0))

# 2.0.x - CoW=False
# 1.46 s ± 14.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# 
# 2.0.x - CoW=True
# 1.47 s ± 6.11 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# 
# main - CoW=False
# 4.35 s ± 50.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# 
# main - CoW=True
# 9.11 s ± 76.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Encountered this trying to update some code to use CoW. The regression exists without CoW, but is also worse with it. Haven't done any investigation yet as to why.