BUG: bug in left join on multi-index with sort=True or nulls by behzadnouri · Pull Request #9210 · pandas-dev/pandas (original) (raw)
on master:
In [8]: left
Out[8]:
1st 2nd 3rd
0 c c 13
1 b b 79
2 a a 27
3 b b 27
4 c a 86
In [9]: right
Out[9]:
4th
1st 2nd
c a -86
b b -79
c c -13
b b -27
a a -27
sort=True
is ignored, and the result is not sorted by the join key:
In [10]: left.join(right, on=['1st', '2nd'], how='left', sort=True)
Out[10]:
1st 2nd 3rd 4th
0 c c 13 -13
1 b b 79 -79
1 b b 79 -27
2 a a 27 -27
3 b b 27 -79
3 b b 27 -27
4 c a 86 -86
in addition:
In [44]: left
Out[44]:
1st 2nd 3rd
0 NaN a 14
1 a NaN 10
2 a b 19
3 NaN NaN 62
4 a c 90
In [45]: right
Out[45]:
4th
1st 2nd
NaN a -14
a c -90
NaN -10
b -19
NaN NaN -62
this works:
In [46]: merge(left, right.reset_index(), on=['1st', '2nd'], how='left')
Out[46]:
1st 2nd 3rd 4th
0 NaN a 14 -14
1 a NaN 10 -10
2 a b 19 -19
3 NaN NaN 62 -62
4 a c 90 -90
but this does not:
In [47]: left.join(right, on=['1st', '2nd'], how='left')
Out[47]:
1st 2nd 3rd 4th
0 NaN a 14 NaN
1 a NaN 10 NaN
2 a b 19 -19
3 NaN NaN 62 NaN
4 a c 90 -90
also, get_group_index called in these lines is subject to overflow, and should be avoided.
r 'join|merge'
benchmarks:
-------------------------------------------------------------------------------
Test name | head[ms] | base[ms] | ratio |
-------------------------------------------------------------------------------
join_dataframe_index_multi | 35.4387 | 36.5883 | 0.9686 |
join_dataframe_index_single_key_bigger_sort | 24.7660 | 24.8604 | 0.9962 |
strings_join_split | 57.6473 | 57.6183 | 1.0005 |
join_dataframe_index_single_key_small | 16.6840 | 16.6461 | 1.0023 |
merge_2intkey_sort | 61.2427 | 60.5460 | 1.0115 |
join_non_unique_equal | 0.9513 | 0.9391 | 1.0130 |
left_outer_join_index | 2887.7623 | 2839.2557 | 1.0171 |
i8merge | 1534.6023 | 1506.8540 | 1.0184 |
join_dataframe_index_single_key_bigger | 25.3410 | 24.8287 | 1.0206 |
merge_2intkey_nosort | 21.6643 | 21.2137 | 1.0212 |
join_dataframe_integer_key | 3.0307 | 2.9414 | 1.0304 |
join_dataframe_integer_2key | 7.7363 | 7.4220 | 1.0423 |
-------------------------------------------------------------------------------
Test name | head[ms] | base[ms] | ratio |
-------------------------------------------------------------------------------
Ratio < 1.0 means the target commit is faster then the baseline.
Seed used: 1234
Target [f02ef89] : bug in left join on multi-index with sort=True or nulls
Base [b62754d] : Merge pull request #9206 from robertdavidwest/9203_resubmitted_in_single_commit
9203 SQUASHED - DOCS: doc string edited pandas/core/frame.duplicated()