transform function is very slow in 0.14 - similar to issue #2121 (original) (raw)

I have a dataframe where I'm computing some summary items for groups and it's convenient to reconsolidate merge those back into the parent dataframe. Transform seems the most intuitive way to do this but I'm seeing much worse performance than doing a manual aggregation and merge.

Starting from issue #2121 marked as closed:

import numpy as np import pandas as pd from pandas import Index, MultiIndex, DataFrame

n_dates = 20000 n_securities = 50 n_columns = 2 share_na = 0.1

dates = pd.date_range('1997-12-31', periods=n_dates, freq='B') dates = Index(map(lambda x: x.year * 10000 + x.month * 100 + x.day, dates))

secid_min = int('10000000', 16) secid_max = int('F0000000', 16) step = (secid_max - secid_min) // (n_securities - 1) security_ids = map(lambda x: hex(x)[2:10].upper(), range(secid_min, secid_max + 1, step))

data_index = MultiIndex(levels=[dates.values], labels=[[i for i in xrange(n_dates) for _ in xrange(n_securities)]], names=['date']) n_data = len(data_index)

columns = Index(['factor{}'.format(i) for i in xrange(1, n_columns + 1)])

data = DataFrame(np.random.randn(n_data, n_columns), index=data_index, columns=columns)

grouped = data.groupby(level='date')

which gives us a (1000000, 2) dataframe of floats

%timeit grouped.transform(max)

1 loops, best of 3: 8.16 s per loop

my alternative is to write something like this:

def mergeTransform(grouped, originalDF): item = grouped.max() colnames = item.columns item = originalDF.merge(item, how='outer', left_index=True, right_index=True).ix[:,-item.shape[1]:] item.columns = colnames

which seems to run much faster despite all the indexing going on:

%%timeit mergeTransform(grouped, data)

10 loops, best of 3: 81.1 ms per loop