Issue 9985: difflib.SequenceMatcher has slightly buggy and undocumented caching behavior (original) (raw)

Created on 2010-09-29 14:32 by marienz, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
get_matching_blocks.diff	christoph,2010-10-01 14:06	Patch & test case fixing get_matching_blocks()	review

Messages (7)
msg117612 - (view)	Author: Marien Zwart (marienz) *	Date: 2010-09-29 14:32
SequenceMatcher caches the result of get_matching_blocks and get_opcodes. There are some problems with this: What get_matching_blocks caches is a list of tuples. The first call does not return that list: it returns map(Match._make, self.matching_blocks) (converting the tuples to namedtuples). Subsequent calls just return self.matching_blocks directly. Especially in python 3 and up this is weird, since the first call returns a map object while later calls return a list. This caching behavior is not documented, so calling code may mutate the returned list. One example of calling code is difflib itself: get_grouped_opcodes mutates the result of get_opcodes (a cached list). I am not sure if the right fix is to have get_grouped_opcodes copy before it mutates or to have get_opcodes return a copy. Snippet demonstrating both bugs: matcher = difflib.SequenceMatcher(a='aaaaaaaabc', b='aaaaaaaadc') print(list(matcher.get_matching_blocks())) # This should print the same thing, but it does not: print(list(matcher.get_matching_blocks())) print(matcher.get_opcodes()) print(list(matcher.get_grouped_opcodes())) # This should print the same thing as the previous get_opcodes() # list, but it does not: print(matcher.get_opcodes())
msg117800 - (view)	Author: Christoph Burgmer (christoph)	Date: 2010-10-01 14:06
Here's a test case and a fix for get_matching_blocks() to return the same content on subsequent calls.
msg117801 - (view)	Author: Christoph Burgmer (christoph)	Date: 2010-10-01 14:07
BTW, here's the commit that broke the behavior in the first place: http://svn.python.org/view/python/trunk/Lib/difflib.py?r1=54230&r2=59907
msg117819 - (view)	Author: Marien Zwart (marienz) *	Date: 2010-10-01 17:41
That fixes the first problem in python 2. It should do: self.matching_blocks = [Match._make(t) for t in non_adjacent] in python 3 though, or it will return an already-exhausted map object if it is called again. This leaves the problem of other code mutating the cached list (not realizing it is cached). I think it would make sense for these functions to just return a shallow copy of their cached list, but I have not thought that through much.
msg117830 - (view)	Author: Ned Deily (ned.deily) *	Date: 2010-10-01 20:09
(+nosy Raymond as committer of r59907 and removing python2.6 since this is not a security issue)
msg236538 - (view)	Author: Mark Lawrence (BreamoreBoy) *	Date: 2015-02-24 19:59
I believe this has been fixed by changesets f02a563ad1bf and ed73c127421c in #21635.
msg350339 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2019-08-24 00:20
This was fixed for Python 2.7 and Python 3.4 and later.

History
Date	User	Action	Args
2022-04-11 14:57:07	admin	set	github: 54194
2019-08-24 00:20:45	rhettinger	set	status: open -> closedresolution: out of datemessages: + stage: needs patch -> resolved
2019-03-15 22:42:57	BreamoreBoy	set	nosy: - BreamoreBoy
2015-02-26 08:20:06	rhettinger	set	assignee: rhettinger
2015-02-24 20:00:00	BreamoreBoy	set	nosy: + BreamoreBoymessages: +
2010-10-01 20:09:30	ned.deily	set	versions: - Python 2.6nosy: + rhettinger, ned.deilymessages: + stage: needs patch
2010-10-01 17:41:26	marienz	set	messages: +
2010-10-01 14:07:44	christoph	set	messages: +
2010-10-01 14:06:41	christoph	set	files: + get_matching_blocks.diffnosy: + christophmessages: + keywords: + patch
2010-09-29 14:32:17	marienz	create