Huge performance hit in get_indexer_non_unique · Issue #15364 · pandas-dev/pandas (original) (raw)

Let df0 and df1 be indexed data frames. Assume that at least one of them has a non-unique index. If I'm not mistaken, the function #L309 will always be called upon the operation:

Now imagine that both of indexes are reasonably large (lets say at least one millions of records). The operations at lines

#L341
#L342
#L351

will be of order O(n^2*log(n)^2) even when both indices are sorted.

Am I missing something? Let me know if the above reasoning is right so I can help with it. If both indexes are sorted, It looks like that function can be as fast as O(n) simply by looping over the values and targets arrays simultaneously. (Assuming that the index duplication is very small compared to n.