BUG: bug in left join on multi-index with sort=True or nulls by behzadnouri · Pull Request #9210 · pandas-dev/pandas (original) (raw)

on master:

In [8]: left
Out[8]:
  1st 2nd  3rd
0   c   c   13
1   b   b   79
2   a   a   27
3   b   b   27
4   c   a   86

In [9]: right
Out[9]:
         4th
1st 2nd
c   a    -86
b   b    -79
c   c    -13
b   b    -27
a   a    -27

sort=True is ignored, and the result is not sorted by the join key:

In [10]: left.join(right, on=['1st', '2nd'], how='left', sort=True)
Out[10]:
  1st 2nd  3rd  4th
0   c   c   13  -13
1   b   b   79  -79
1   b   b   79  -27
2   a   a   27  -27
3   b   b   27  -79
3   b   b   27  -27
4   c   a   86  -86

in addition:

In [44]: left
Out[44]:
   1st  2nd  3rd
0  NaN    a   14
1    a  NaN   10
2    a    b   19
3  NaN  NaN   62
4    a    c   90

In [45]: right
Out[45]:
         4th
1st 2nd
NaN a    -14
a   c    -90
    NaN  -10
    b    -19
NaN NaN  -62

this works:

In [46]: merge(left, right.reset_index(), on=['1st', '2nd'], how='left')
Out[46]:
   1st  2nd  3rd  4th
0  NaN    a   14  -14
1    a  NaN   10  -10
2    a    b   19  -19
3  NaN  NaN   62  -62
4    a    c   90  -90

but this does not:

In [47]: left.join(right, on=['1st', '2nd'], how='left')
Out[47]:
   1st  2nd  3rd  4th
0  NaN    a   14  NaN
1    a  NaN   10  NaN
2    a    b   19  -19
3  NaN  NaN   62  NaN
4    a    c   90  -90

also, get_group_index called in these lines is subject to overflow, and should be avoided.

r 'join|merge' benchmarks:

-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------
join_dataframe_index_multi                   |  35.4387 |  36.5883 |   0.9686 |
join_dataframe_index_single_key_bigger_sort  |  24.7660 |  24.8604 |   0.9962 |
strings_join_split                           |  57.6473 |  57.6183 |   1.0005 |
join_dataframe_index_single_key_small        |  16.6840 |  16.6461 |   1.0023 |
merge_2intkey_sort                           |  61.2427 |  60.5460 |   1.0115 |
join_non_unique_equal                        |   0.9513 |   0.9391 |   1.0130 |
left_outer_join_index                        | 2887.7623 | 2839.2557 |   1.0171 |
i8merge                                      | 1534.6023 | 1506.8540 |   1.0184 |
join_dataframe_index_single_key_bigger       |  25.3410 |  24.8287 |   1.0206 |
merge_2intkey_nosort                         |  21.6643 |  21.2137 |   1.0212 |
join_dataframe_integer_key                   |   3.0307 |   2.9414 |   1.0304 |
join_dataframe_integer_2key                  |   7.7363 |   7.4220 |   1.0423 |
-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------

Ratio < 1.0 means the target commit is faster then the baseline.
Seed used: 1234

Target [f02ef89] : bug in left join on multi-index with sort=True or nulls
Base   [b62754d] : Merge pull request #9206 from robertdavidwest/9203_resubmitted_in_single_commit

9203 SQUASHED - DOCS: doc string edited pandas/core/frame.duplicated()