PERF: .median(axis=1) perf issues · Issue #16468 · pandas-dev/pandas (original) (raw)
In [2]: df = pd.DataFrame(np.random.randn(10000, 2), columns=list('AB'))
In [3]: result1 = df.median(1)
In [4]: result2 = pd.Series(np.nanmedian(df.values, axis=1), index=df.index)
In [5]: result1.equals(result2)
Out[5]: True
In [6]: %timeit result1 = df.median(1)
241 µs ± 4.71 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [7]: %timeit df.median(1)
250 µs ± 5.74 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [8]: %timeit pd.Series(np.nanmedian(df.values, axis=1), index=df.index)
1.77 ms ± 32.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [9]: pd.set_option('use_bottleneck', False)
In [10]: result3 = df.median(1)
In [11]: result1.equals(result3)
Out[11]: True
In [12]: %timeit df.median(1)
317 ms ± 9.66 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
So, if bottleneck is installed, then df.median(1)
is blazingly fast. However if its NOT installed (or not used), then we fallback to np.apply_along_axis(our_median_impl)
, so our median impl is pretty fast itself, but it only handles 1d, so this is a pythonic loop.
To fix we can use np.nanmedian
soln if available (its in >= numpy 1.9, currently we support >= 1.7).