Implement hash_join for merges by phofl · Pull Request #57970 · pandas-dev/pandas (original) (raw)

Our abstraction in merges is bad, this makes it a little worse unfortunately. But it enables a potentially huge performance improvement for joins that could be hash joins. I am using "right" to make the decision, because left determines the result order, which means that we would have to sort after we are finished which gives the performance improvement back. using right makes this problem go away.

We get time complexity O(m+n) here over O(m*n) with a non-trivial factor as before

This makes adding semi joins pretty easy as well, which is nice next to the performance improvements here.

| Change   | Before [38086f11] <backtest>   | After [3b6b787e] <to>   |   Ratio | Benchmark (Parameter)                                                       |
|----------|--------------------------------|-------------------------|---------|-----------------------------------------------------------------------------|
| -        | 193±4ms                        | 165±5ms                 |    0.85 | join_merge.I8Merge.time_i8merge('inner')                                    |
| -        | 7.76±0.2ms                     | 6.53±0.06ms             |    0.84 | join_merge.Merge.time_merge_2intkey(False)                                  |
| -        | 1.03±0.02ms                    | 834±4μs                 |    0.81 | join_merge.Merge.time_merge_dataframe_integer_2key(False)                   |
| -        | 485±2μs                        | 339±5μs                 |    0.7  | join_merge.Merge.time_merge_dataframe_integer_key(False)                    |
| -        | 3.29±0.2ms                     | 2.29±0.07ms             |    0.7  | join_merge.MergeDatetime.time_merge(('ms', 'ms'), 'Europe/Brussels', False) |
| -        | 3.31±0.07ms                    | 2.27±0.07ms             |    0.69 | join_merge.MergeDatetime.time_merge(('ms', 'ms'), None, False)              |
| -        | 2.79±0.09ms                    | 1.77±0.01ms             |    0.63 | join_merge.MergeDatetime.time_merge(('ns', 'ms'), 'Europe/Brussels', False) |
| -        | 2.89±0.04ms                    | 1.78±0.05ms             |    0.62 | join_merge.MergeDatetime.time_merge(('ns', 'ms'), None, False)              |
| -        | 2.57±0.09ms                    | 1.56±0.03ms             |    0.61 | join_merge.MergeDatetime.time_merge(('ns', 'ns'), None, False)              |
| -        | 1.97±0.05ms                    | 1.18±0.02ms             |    0.6  | join_merge.MergeEA.time_merge('Float32', False)                             |
| -        | 1.84±0.02ms                    | 1.10±0.03ms             |    0.6  | join_merge.MergeEA.time_merge('UInt16', False)                              |
| -        | 2.09±0.04ms                    | 1.24±0.01ms             |    0.59 | join_merge.MergeEA.time_merge('UInt64', False)                              |
| -        | 2.10±0.09ms                    | 1.22±0.01ms             |    0.58 | join_merge.MergeEA.time_merge('Float64', False)                             |
| -        | 2.09±0.08ms                    | 1.22±0.01ms             |    0.58 | join_merge.MergeEA.time_merge('UInt32', False)                              |
| -        | 2.70±0.1ms                     | 1.54±0.02ms             |    0.57 | join_merge.MergeDatetime.time_merge(('ns', 'ns'), 'Europe/Brussels', False) |
| -        | 1.72±0.02ms                    | 971±20μs                |    0.57 | join_merge.MergeEA.time_merge('Int16', False)                               |
| -        | 1.76±0.03ms                    | 973±10μs                |    0.55 | join_merge.MergeEA.time_merge('Int32', False)                               |
| -        | 1.94±0.09ms                    | 1.07±0.03ms             |    0.55 | join_merge.MergeEA.time_merge('Int64', False)                               |
| -        | 57.0±2ms                       | 21.7±0.3ms              |    0.38 | join_merge.UniqueMerge.time_unique_merge(1000000)                           |
| -        | 106±7ms                        | 34.1±0.3ms              |    0.32 | join_merge.UniqueMerge.time_unique_merge(4000000)                           |