Issue 4622: SequenceMatcher bug with long sequences (original) (raw)

Issue4622

Created on 2008-12-10 17:20 by eli.bendersky, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (5)
msg77559 - (view) Author: Eli Bendersky (eli.bendersky) * (Python committer) Date: 2008-12-10 17:20
Here's a reproduction of the error: Python 2.5.2 (r252:60911, Oct 20 2008, 09:11:31) [GCC 3.4.6 20060404 (Red Hat 3.4.6-10)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import difflib >>> >>> difflib.SequenceMatcher(None, [4] + [5] * 200, [5] * 200).ratio() 0.0 ratio() should be returning close to 1.0 here, not 0. This is only a problem for sequences longer than 200. The analogous run for 100: >>> difflib.SequenceMatcher(None, [4] + [5] * 100, [5] * 100).ratio() 0.99502487562189057 >>> I've managed to reproduce it on Linux, Windows (AS 2.5.2) and Try Python (http://try-python.mired.org/)
msg77561 - (view) Author: David W. Lambert (LambertDW) Date: 2008-12-10 18:02
Python 3.0rc1+ similar.
msg77565 - (view) Author: Gabriel Genellina (ggenellina) Date: 2008-12-10 19:38
Python 2.3.4 and later have this bug. But release 2.1.3 doesn't: Python 2.1.3 (#35, Apr 8 2002, 17:47:50) [MSC 32 bit (Intel)] on win32 Type "copyright", "credits" or "license" for more information. >>> import difflib >>> difflib.SequenceMatcher(None, [4] + [5] * 500, [5] * 500).ratio() 0.99900099900099903 >>> difflib.SequenceMatcher(None, [4] + [5] * 200, [5] * 200).ratio() 0.99750623441396513 >>> difflib.SequenceMatcher(None, [4] + [5] * 100, [5] * 100).ratio() 0.99502487562189057 I don't have any 2.2 release to test right now.
msg77566 - (view) Author: Gabriel Genellina (ggenellina) Date: 2008-12-10 19:42
#2986 may be a duplicate of this; #1528074 is relevant too.
msg108635 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2010-06-25 21:55
This appears to be one of at least three duplicate issues: #1528074, #2986, and #4622. I am closing two, leaving 2986 open, and merging the nearly disjoint nosy lists. (If no longer interested, you can delete yourself from 2986.) #1711800 appears to be slightly different (if not, it could be closed also.) Whether or not a new feature is ever added (earliest, now, 3.2), it appears that the docs need improvement to at least explain the current behavior. If someone who understands the issue could open a separate doc issue (for 2.6/7/3.1/2) with a suggested addition, that would be great.
History
Date User Action Args
2022-04-11 14:56:42 admin set github: 48872
2010-06-25 21:55:55 terry.reedy set status: open -> closednosy: + terry.reedymessages: + superseder: difflib.SequenceMatcher not matching long sequencesresolution: duplicate
2008-12-10 19:42:21 ggenellina set messages: +
2008-12-10 19:38:27 ggenellina set nosy: + ggenellinamessages: +
2008-12-10 18:02:57 LambertDW set nosy: + LambertDWmessages: + versions: + Python 3.0
2008-12-10 17:20:54 eli.bendersky create