Issue 7846: Fnmatch cache is never cleared during usage (original) (raw)

Created on 2010-02-03 15:02 by andrewclegg, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
py27-fnmatch.patch andrewclegg,2010-07-09 09:57
py3k-fnmatch.patch andrewclegg,2010-07-09 09:57
py3k-fnmatch.patch andrewclegg,2010-07-09 10:55
Messages (21)
msg98786 - (view) Author: Andrew Clegg (andrewclegg) * Date: 2010-02-03 15:02
The fnmatch module has a cache of translation between glob patterns and compiled regular expressions. However this cache is never emptied; only added to. I am writing a python program which as part of its execution checks millions of unique globs - this causes the fnmatch cache to grow enormous. Attached is a patch to limit the size of the fnmatch cache to 100 (chosen to be the same size as the re module).
msg99385 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2010-02-16 05:02
Hello The feature seems useful to me and the patch straightforward. I wonder why you used a global name for the max length (“_MAXCACHE”) instead of inlining the value where it is used. Does it feel more constanty? ;) I fear that the leading underscore won’t discourage users from touching it, so I’d suggest making it public or completely hidden. On second thought, you could argue that Python is a matter of consenting adults, and that leading underscore is okay. Regards
msg99392 - (view) Author: Andrew Clegg (andrewclegg) * Date: 2010-02-16 09:38
Hi, I used a global name for a couple of reasons: it is consistent with the cache itself and the size of the cache being defined in the same place; and because I wanted to allow the value to be modified by users. I used the leading underscore to give a weak indication of internal use - for most users, it can be completely ignored. If someone wanted to modify the maximum size of the cache however, this would be a reasonable and safe thing for them to do. Cheers
msg99429 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2010-02-16 18:04
Hi Your rationale makes perfect sense. Perhaps add a comment suggesting it’s ok to change the value. Cheers
msg99430 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-02-16 18:10
I haven't reviewed the patch, but if the intent is that the user can change the global, then it should not start with an _, and it should be a documented part of the API. If you want to suggest that users should not normally change it, then you can do that in the docs. If you give something an _, then it is not considered part of the public API and it (the internal API, not the value) is subject to change, which means you should *not* suggest that users change it. If they find it and want to change it anyway, that's their lookout. That's the consenting adults part :)
msg99464 - (view) Author: Andrew Clegg (andrewclegg) * Date: 2010-02-17 10:08
"If you give something an _, then it is not considered part of the public API and it (the internal API, not the value) is subject to change, which means you should *not* suggest that users change it. If they find it and want to change it anyway, that's their lookout. That's the consenting adults part :)" This sums it up well - it should be considered a detail of implementation, thus the _. When I said I wanted to allow the value to be modified, I was thinking of the 'consenting adult' type of usage, not everyday usage. Of course there is no reasonable way to prevent the value being modified anyway. So - _MAXCACHE can stay where it is, but should not be documented or be included in __all__. Does that seem reasonable?
msg109532 - (view) Author: Andrew Clegg (andrewclegg) * Date: 2010-07-08 10:47
Hi, This bug seems to have stalled - how can I get it moved forward? There don't seem to be any objections with the patch as-is, and the problem seems clear. Cheers
msg109543 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-07-08 13:53
OK, I've looked at the patch now. (1) it needs unit tests. (2) a py3k port would be helpful, since the code is very different in py3k, and we are now applying patches first to the py3k branch and then backporting. I think this can be considered a bug (the ever-expanding cache), so I'm OK with applying this in 2.7.1 once it has unit tests. Note that using _MAXCACHE (leading underscore) and not documenting it follows the 're' model, so that is the best choice.
msg109545 - (view) Author: Andrew Clegg (andrewclegg) * Date: 2010-07-08 14:50
Attached is the Py3K version of the patch, and a unit test (Py3K only).
msg109547 - (view) Author: Andrew Clegg (andrewclegg) * Date: 2010-07-08 15:04
Sorry, messed up indentation on last patch, trying again...
msg109549 - (view) Author: Andrew Clegg (andrewclegg) * Date: 2010-07-08 15:08
Attached is a patch and unit test against release27-maint.
msg109552 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-07-08 16:14
Thanks for working on this. I tried the py3k patch but it doesn't apply cleanly. Patch says "patch: **** malformed patch at line 57:" followed by an apparently blank line. Also, could you please generate the patches from the top level of the checkout? It makes it easier to apply patches if they all have the same starting point.
msg109704 - (view) Author: Andrew Clegg (andrewclegg) * Date: 2010-07-09 09:57
OK, regenerated both patches from the top level of the checkout (paths within the patch are 'Lib/fnmatch.py' and 'Lib/test/test_fnmatch.py'. Sorry about the corrupted patch before - I've tested this version by reverting my changes and reapplying it, and it seems to work cleanly. Cheers
msg109708 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-07-09 10:41
Instead of writing `bytes('foo', encoding='ascii')`, you can just write `b'foo'`. Otherwise, patch looks ok.
msg109710 - (view) Author: Andrew Clegg (andrewclegg) * Date: 2010-07-09 10:55
Updated py3k patch to use the more concise bytes syntax, cheers.
msg109734 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-07-09 13:37
Committed to py3k in r82730, 3.1 in r82733, 2.7 in r82732, and, since 2.6.6 is still to come, 2.6 in r82740. Andrew: FYI I changed the assertTrue to assertLessEqual to take advantage of the better failure messages provided by the newer unittest methods. Thanks for the patch.
msg113358 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2010-08-09 02:07
For Py3.2, applying the new functools.lru_cache(). See r83871.
msg139454 - (view) Author: Tom Rini (Tom.Rini) Date: 2011-06-29 22:59
Did a change later make this user-configurable? I've got some code here that now runs so slow as to be unusable because nothing is ever cached anymore. Thanks!
msg139486 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011-06-30 15:07
Well, it's unnoficially configurable pre-3.2 (set _MAXCACHE on the module). The lru_cache implementation doesn't have such an undocumented way of tweaking the limit. In neither case, however, is it true that "nothing is ever cached". You are saying that you are dealing with more than 100 patterns in a single application? (256 if you are using 3.2, and in that case it is an LRU cache and so should perform well unless your more-than-256 patterns are used in a cyclical fashion.)
msg139494 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011-06-30 15:38
Grr. "it is *not* true that nothing is ever cached".
msg139623 - (view) Author: Tom Rini (Tom.Rini) Date: 2011-07-02 03:17
So yes, we have some code that was doing, roughly: if any(fnmatchcase(key, pat) for pat in pattern): refs.add(key) on 200-300 elements, a lot. That said, many of them were not globs so we've worked around this to only use fnmatchcase on globs and just key in set otherwise. Things are now usable on python 2.6.6+ (so Ubuntu 10.10) and noise otherwise.
History
Date User Action Args
2022-04-11 14:56:57 admin set github: 52094
2011-07-02 03:17:17 Tom.Rini set messages: +
2011-06-30 15:38:58 r.david.murray set messages: +
2011-06-30 15:07:59 r.david.murray set messages: +
2011-06-29 22:59:15 Tom.Rini set nosy: + Tom.Rinimessages: +
2010-08-09 02:07:29 rhettinger set nosy: + rhettingermessages: +
2010-07-09 13:37:55 r.david.murray set status: open -> closedversions: + Python 2.6messages: + resolution: accepted -> fixedstage: test needed -> resolved
2010-07-09 12:23:48 r.david.murray set assignee: r.david.murray
2010-07-09 10:55:12 andrewclegg set files: + py3k-fnmatch.patchmessages: +
2010-07-09 10:41:30 pitrou set nosy: + pitroumessages: +
2010-07-09 09:57:36 andrewclegg set files: + py3k-fnmatch.patch
2010-07-09 09:57:28 andrewclegg set files: + py27-fnmatch.patchmessages: +
2010-07-09 09:54:29 andrewclegg set files: - py27-fnmatch.patch
2010-07-09 09:54:26 andrewclegg set files: - py3k-fnmatch.patch
2010-07-08 16:14:47 r.david.murray set messages: +
2010-07-08 15:08:47 andrewclegg set files: - fnmatch.patch
2010-07-08 15:08:17 andrewclegg set files: + py27-fnmatch.patchmessages: +
2010-07-08 15:04:25 andrewclegg set files: + py3k-fnmatch.patchmessages: +
2010-07-08 15:04:07 andrewclegg set files: - py3k-fnmatch.patch
2010-07-08 14:50:57 andrewclegg set files: + py3k-fnmatch.patchmessages: +
2010-07-08 13:53:13 r.david.murray set resolution: acceptedmessages: + versions: - Python 2.6
2010-07-08 10:47:01 andrewclegg set messages: +
2010-02-17 10:08:40 andrewclegg set messages: +
2010-02-16 18:10:54 r.david.murray set type: behavior -> resource usagemessages: + nosy: + r.david.murray
2010-02-16 18:04:06 eric.araujo set messages: +
2010-02-16 09:38:45 andrewclegg set messages: +
2010-02-16 05:02:10 eric.araujo set nosy: + eric.araujomessages: +
2010-02-03 15:29:24 brian.curtin set priority: normalkeywords: + needs reviewstage: test neededtype: behaviorversions: + Python 2.6, Python 3.1, Python 2.7, Python 3.2, - Python 2.5
2010-02-03 15:02:13 andrewclegg create