Implement high performance rolling_rank (original) (raw)

xref SO issue here

Im looking to set the rolling rank on a dataframe. Having posted, discussed and analysed the code it looks like the suggested way would be to use the pandas Series.rank function as an argument in rolling_apply. However on large datasets the performance is particularly poor. I have tried different implementations and using bottlenecks rank method orders of magnitude faster, but that only offers the average option for ties. It is also still some way off the performance of rolling_mean. I have previously implemented a rolling rank function which monitors changes on a moving window (in a similar way to algos.roll_mean I believe) rather that recalculating the rank from scratch on each window. Below is an example to highlight the performance, it should be possible to implement a rolling rank with comparable performance to rolling_mean.

python: 2.7.3
pandas: 0.15.2
scipy: 0.10.1
bottleneck: 0.7.0

rollWindow = 240 df = pd.DataFrame(np.random.randn(100000,4), columns=list('ABCD'), index=pd.date_range('1/1/2000', periods=100000, freq='1H')) df.iloc[-3:-1]['A'] = 7.5 df.iloc[-1]['A'] = 5.5

df["SER_RK"] = pd.rolling_apply(df["A"], rollWindow, rollingRankOnSeries) #28.9secs (allows competition/min ranking for ties)

df["SCIPY_RK"] = pd.rolling_apply(df["A"], rollWindow, rollingRankSciPy) #70.89secs (allows competition/min ranking for ties)

df["BNECK_RK"] = pd.rolling_apply(df["A"], rollWindow, rollingRankBottleneck) #3.64secs (only provides average ranking for ties)

df["ASRT_RK"] = pd.rolling_apply(df["A"], rollWindow, rollingRankArgSort) #3.56secs (only provides competition/min ranking for ties, not necessarily correct result)

df["MEAN"] = pd.rolling_mean(df['A'], window=rollWindow) #0.008secs

def rollingRankOnSeries (array): s = pd.Series(array) return s.rank(method='min', ascending=False)[len(s)-1]

def rollingRankSciPy (array): return array.size + 1 - sc.rankdata(array)[-1]

def rollingRankBottleneck (array): return array.size + 1 - bd.rankdata(array)[-1]

def rollingRankArgSort (array): return array.size - array.argsort().argsort()[-1]

```python

I think this is likely to be a common request for users looking to use pandas for analysis on large datasets and thought it would be a useful addition to the pandas moving statistics/moments suite?