[Python-Dev] Issue 2986: difflib.SequenceMatcher is partly broken (original) (raw)

Tim Peters tim.peters at gmail.com
Tue Jul 13 23:20:24 CEST 2010


[Tim]

... BTW, it's not clear whether ratio() computes a useful value in the presence of junk, however that may be defined.

[Terry Reedy]

I agree, which is one reason why one should be to disable auto-junking.

Yup.

There are a number of statistical methods for analyzing similarity matrices, analogous to correlation matrices, except that entries are in [0.0,1.0] rather than [-1.0,1.0]. For my Ph.D. thesis, I did such analyses for sets of species. Similarity measures should ususally be symmetric and increase with greater matching. The heuristic can cause both to fail.

Two things about that. First, ratio() is here defined to be (when comparing sequence a and b):

2.0 * number_of_matches / (len(a) + len(b))

The denominator is not affected by any junk heuristics, so this ratio does indeed increase directly with "greater matching". But I don't know exactly what you mean by "greater matching" - I get the impression you think that's an inherent property of the sequences, but, as before, I don't see any meaning for it independent of the specific matching algorithm used.

The current implementation of quick_ratio() may be closer to what you seem to be thinking. That views both sequences as multisets, and considers number_of_matches to be the cardinality of their intersection. That measure ignores the possibility of junk, and also ignores the order in which elements appear - it's wholly independent of the matching algorithm. But it returns 1.0 when the second sequence is any permutation of the elements in the first sequence, so throws away any notion of ordering.

Second, it's common as mud for ratio() to give a different result depending on the order of SequenceMatcher's arguments, junk or no junk. This is mostly because it's a good thing for SequenceMatcher to run in finite time ;-)

It's not just comparing sequences in the abstract, it's doing so in the context of producing a one-pass "left to right" edit sequence transforming the first sequence into the second, using only "insert" and "delete" operations. If the longest matching block between sequences X and Y is M, X and Y are effectively split into:

X = A + M + B
Y = C + M + D

and then the same process is recursively applied to the pairs (A, C) and (B, D). If, for example, there are many matches between blocks A and D, they'll never be seen - a one-pass edit sequence restricted to "insert" and "delete" has to view M as a wall, transforming A into C entirely before it can skip over the matching M and start thinking about how to transform B into D.

For that reason, when there are multiple maximal matching blocks, exactly which one is picked to be M can have percolating effects on how many additional matches are found later (BTW, it's also a reason for why a notion of junk can improve the subjective quality of results as judged by humans: if M is composed of things all of which "look interesting" to people, people tend to think the match is intuitive).

The method find_longest_match() uses to pick a winner is documented:

Return (i,j,k) such that a[i:i+k] is equal to b[j:j+k], where
    alo <= i <= i+k <= ahi
    blo <= j <= j+k <= bhi
and for all (i',j',k') meeting those conditions,
    k >= k'
    i <= i'
    and if i == i', j <= j'

In other words, of all maximal matching blocks, return one that
starts earliest in a, and of all those maximal matching blocks that
start earliest in a, return the one that starts earliest in b.

This has consequences. Here's a tiny example:

X = "abbb"
Y = "bbab"

All maximal matching blocks between X and Y have length 2. "bb" appears twice in X and once in "Y". "ab" appears once in each. Pass X first, and M="ab" is picked as the winner. That breaks the sequences into:

X = "" + "ab" + "bb"
Y = "bb" + "ab" + ""

and no further matches are found between "" and "bb", or between "bb" and "". ratio() thus returns 0.5.

But pass Y first, and M="bb" is picked as the winner, breaking the sequences into:

X = "a" + "bb" + "b"
Y = "" + "bb" + "ab"

and an additional match (on "b") is later found when comparing the suffixes ("b" and "ab"). ratio() thus returns 0.75 in that case.

I can conceive of trying all possible winners (and so on recursively), and then backtracking to pick a winner at each branch that maximizes the total number of matches - but "it's a good thing for SequenceMatcher to run in finite time" ;-)

There are multiple possible definitions of similarity for sets (and arguments thereabout). I am sure the same is true for sequences. But I consider the definition for .ratio, without the heuristic, to be sensible. I would consider using it should the occasion arise.

If you ever used difflib's file-comparison functions, they used ratio() extensively under the covers. The most interesting case when producing diffs for human consumption is when two blocks have no matching lines in common. difflib's file comparators look for the two most similar lines (highest ratio()) in that case, trying to guess where (&, later, how) a human may have edited a line. This produces terrifically useful & intuitive output - when it works ;-)

It certainly was the intent that nothing would be called junk unless it appeared at least twice, so the "n>= 200" clause ensures that 1% of n is at least 2.

Since 2 cannot be greater than something that is at least 2, you ensured that nothing would be called junk unless it appears as least thrice.

OK, staring at the code the minimum is actually 4 times. It's true that

if n >= 200 and len(indices) * 100 > n:

can't succeed when len(indices) is 2, but when it's 3 we've previously seen 3 instances of the element and are currently looking at the fourth.

...

I'd call it "autojunk", cuz that's what it would do.  Also a useful syntactic overlap with the name of the current "isjunk" argument.

I like that. I am now leaning toward the following?

G (I hope, this time, for 'go' ;-). For 2.7.1, 3.1.3, and 3.2, add 'autojunk = True' to the constructor signature. This is the minimal change that fixes the bug of no choice while keeping the default as is. So it is a minimal violation of the usual stricture against API changes in bugfix releases.

I'm not current on the rules for bugfix releases, but way back when I'm afraid this wouldn't be allowed. The problem is that anyone using the new argument in 2.7.1 would be creating code that would blow up if run under 2.7. Seems that the 6 people who care ;-) about this case would happily agree to live with that, but I have to defer to those more current on bugfix release practice.

I would doc this as "Use an internal heuristic that identifies 'common' items as junk." and separately describe the 'current heuristic', leaving open the possibility of changing it.

Possible suboption: enforce 'autojunk in (True,False)' so the user cannot forget that it is a switch and not a tuning parameter.

Don't think that part is needed.

In 3.2, expose as an attribute a tuple 'hueristic' or 'heuristic' with the tuning parameters for the current heuristic. Adding the would indicate that is it a private, expert-only, use at your own risk, subject to change attribute.

If we agree on this much, we can then discuss what the tuple should be for 3.2.

Think the most pressing thing is to give people a way to turn the damn thing off. An ugly way would be to trigger on an unlikely input-output behavior of the existing isjunk argument. For example, if

isjunk("what's the airspeed velocity of an unladen swallow?")

returned

"don't use auto junk!"

and 2.7.1 recognized that as meaning "don't use auto junk", code could be written under 2.7.1 that didn't blow up under 2.7. It could behave differently, although that's true of any way of disabling the auto-junk heuristics.



More information about the Python-Dev mailing list