PERF: improves merge performance when key space exceeds i8 bounds by behzadnouri · Pull Request #9151 · pandas-dev/pandas (original) (raw)
In join operations, current master switches to a less efficient path if the key space exceeds int64
bounds. This commit improves performance and memory usage:
on master:
In [1]: np.random.seed(2718281)
In [2]: left = DataFrame(np.random.randint(-1 << 10, 1 << 10, (1 << 20, 8)),
...: columns=list('ABCDEFG') + ['left'])
In [3]: i = np.random.permutation(len(left))
In [4]: right = left.iloc[i].copy()
In [5]: right.columns = right.columns[:-1].tolist() + ['right']
In [6]: %timeit pd.merge(left, right, how='outer')
1 loops, best of 3: 13.8 s per loop
In [7]: %memit pd.merge(left, right, how='outer')
peak memory: 1064.16 MiB, increment: 820.65 MiB
on branch:
In [6]: %timeit pd.merge(left, right, how='outer')
1 loops, best of 3: 1.42 s per loop
In [7]: %memit pd.merge(left, right, how='outer')
peak memory: 440.72 MiB, increment: 199.89 MiB
join|merge
benchmarks:
-------------------------------------------------------------------------------
Test name | head[ms] | base[ms] | ratio |
-------------------------------------------------------------------------------
i8merge | 1510.8590 | 13001.0207 | 0.1162 |
merge_2intkey_nosort | 20.9847 | 27.0010 | 0.7772 |
join_dataframe_index_single_key_small | 16.4944 | 19.0903 | 0.8640 |
join_dataframe_integer_2key | 7.3093 | 8.0273 | 0.9106 |
merge_2intkey_sort | 60.6734 | 61.8456 | 0.9810 |
left_outer_join_index | 3165.2357 | 3214.8040 | 0.9846 |
join_dataframe_index_multi | 36.0967 | 36.5300 | 0.9881 |
join_dataframe_index_single_key_bigger_sort | 24.7893 | 25.0640 | 0.9890 |
join_non_unique_equal | 0.9350 | 0.9403 | 0.9943 |
join_dataframe_index_single_key_bigger | 24.8820 | 24.9500 | 0.9973 |
strings_join_split | 57.1507 | 56.9630 | 1.0033 |
join_dataframe_integer_key | 2.9247 | 2.8093 | 1.0411 |
-------------------------------------------------------------------------------
Test name | head[ms] | base[ms] | ratio |
-------------------------------------------------------------------------------
Ratio < 1.0 means the target commit is faster then the baseline.
Seed used: 1234
Target [f2d9e17] : improves merge performance when key space exceeds i8 bounds
Base [def58c9] : Merge pull request #9128 from hsperr/expanduser
ENH: Expanduser in to_file methods GH9066