PERF: improves merge performance when key space exceeds i8 bounds by behzadnouri · Pull Request #9151 · pandas-dev/pandas (original) (raw)

In join operations, current master switches to a less efficient path if the key space exceeds int64 bounds. This commit improves performance and memory usage:

on master:

In [1]: np.random.seed(2718281)

In [2]: left = DataFrame(np.random.randint(-1 << 10, 1 << 10, (1 << 20, 8)),
   ...:                  columns=list('ABCDEFG') + ['left'])

In [3]: i = np.random.permutation(len(left))

In [4]: right = left.iloc[i].copy()

In [5]: right.columns = right.columns[:-1].tolist() + ['right']

In [6]: %timeit pd.merge(left, right, how='outer')
1 loops, best of 3: 13.8 s per loop

In [7]: %memit pd.merge(left, right, how='outer')
peak memory: 1064.16 MiB, increment: 820.65 MiB

on branch:

In [6]: %timeit pd.merge(left, right, how='outer')
1 loops, best of 3: 1.42 s per loop

In [7]: %memit pd.merge(left, right, how='outer')
peak memory: 440.72 MiB, increment: 199.89 MiB

join|merge benchmarks:

-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------
i8merge                                      | 1510.8590 | 13001.0207 |   0.1162 |
merge_2intkey_nosort                         |  20.9847 |  27.0010 |   0.7772 |
join_dataframe_index_single_key_small        |  16.4944 |  19.0903 |   0.8640 |
join_dataframe_integer_2key                  |   7.3093 |   8.0273 |   0.9106 |
merge_2intkey_sort                           |  60.6734 |  61.8456 |   0.9810 |
left_outer_join_index                        | 3165.2357 | 3214.8040 |   0.9846 |
join_dataframe_index_multi                   |  36.0967 |  36.5300 |   0.9881 |
join_dataframe_index_single_key_bigger_sort  |  24.7893 |  25.0640 |   0.9890 |
join_non_unique_equal                        |   0.9350 |   0.9403 |   0.9943 |
join_dataframe_index_single_key_bigger       |  24.8820 |  24.9500 |   0.9973 |
strings_join_split                           |  57.1507 |  56.9630 |   1.0033 |
join_dataframe_integer_key                   |   2.9247 |   2.8093 |   1.0411 |
-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------

Ratio < 1.0 means the target commit is faster then the baseline.
Seed used: 1234

Target [f2d9e17] : improves merge performance when key space exceeds i8 bounds
Base   [def58c9] : Merge pull request #9128 from hsperr/expanduser

ENH: Expanduser in to_file methods GH9066