msg98786 - (view) |
Author: Andrew Clegg (andrewclegg) * |
Date: 2010-02-03 15:02 |
The fnmatch module has a cache of translation between glob patterns and compiled regular expressions. However this cache is never emptied; only added to. I am writing a python program which as part of its execution checks millions of unique globs - this causes the fnmatch cache to grow enormous. Attached is a patch to limit the size of the fnmatch cache to 100 (chosen to be the same size as the re module). |
|
|
msg99385 - (view) |
Author: Éric Araujo (eric.araujo) *  |
Date: 2010-02-16 05:02 |
Hello The feature seems useful to me and the patch straightforward. I wonder why you used a global name for the max length (“_MAXCACHE”) instead of inlining the value where it is used. Does it feel more constanty? ;) I fear that the leading underscore won’t discourage users from touching it, so I’d suggest making it public or completely hidden. On second thought, you could argue that Python is a matter of consenting adults, and that leading underscore is okay. Regards |
|
|
msg99392 - (view) |
Author: Andrew Clegg (andrewclegg) * |
Date: 2010-02-16 09:38 |
Hi, I used a global name for a couple of reasons: it is consistent with the cache itself and the size of the cache being defined in the same place; and because I wanted to allow the value to be modified by users. I used the leading underscore to give a weak indication of internal use - for most users, it can be completely ignored. If someone wanted to modify the maximum size of the cache however, this would be a reasonable and safe thing for them to do. Cheers |
|
|
msg99429 - (view) |
Author: Éric Araujo (eric.araujo) *  |
Date: 2010-02-16 18:04 |
Hi Your rationale makes perfect sense. Perhaps add a comment suggesting it’s ok to change the value. Cheers |
|
|
msg99430 - (view) |
Author: R. David Murray (r.david.murray) *  |
Date: 2010-02-16 18:10 |
I haven't reviewed the patch, but if the intent is that the user can change the global, then it should not start with an _, and it should be a documented part of the API. If you want to suggest that users should not normally change it, then you can do that in the docs. If you give something an _, then it is not considered part of the public API and it (the internal API, not the value) is subject to change, which means you should *not* suggest that users change it. If they find it and want to change it anyway, that's their lookout. That's the consenting adults part :) |
|
|
msg99464 - (view) |
Author: Andrew Clegg (andrewclegg) * |
Date: 2010-02-17 10:08 |
"If you give something an _, then it is not considered part of the public API and it (the internal API, not the value) is subject to change, which means you should *not* suggest that users change it. If they find it and want to change it anyway, that's their lookout. That's the consenting adults part :)" This sums it up well - it should be considered a detail of implementation, thus the _. When I said I wanted to allow the value to be modified, I was thinking of the 'consenting adult' type of usage, not everyday usage. Of course there is no reasonable way to prevent the value being modified anyway. So - _MAXCACHE can stay where it is, but should not be documented or be included in __all__. Does that seem reasonable? |
|
|
msg109532 - (view) |
Author: Andrew Clegg (andrewclegg) * |
Date: 2010-07-08 10:47 |
Hi, This bug seems to have stalled - how can I get it moved forward? There don't seem to be any objections with the patch as-is, and the problem seems clear. Cheers |
|
|
msg109543 - (view) |
Author: R. David Murray (r.david.murray) *  |
Date: 2010-07-08 13:53 |
OK, I've looked at the patch now. (1) it needs unit tests. (2) a py3k port would be helpful, since the code is very different in py3k, and we are now applying patches first to the py3k branch and then backporting. I think this can be considered a bug (the ever-expanding cache), so I'm OK with applying this in 2.7.1 once it has unit tests. Note that using _MAXCACHE (leading underscore) and not documenting it follows the 're' model, so that is the best choice. |
|
|
msg109545 - (view) |
Author: Andrew Clegg (andrewclegg) * |
Date: 2010-07-08 14:50 |
Attached is the Py3K version of the patch, and a unit test (Py3K only). |
|
|
msg109547 - (view) |
Author: Andrew Clegg (andrewclegg) * |
Date: 2010-07-08 15:04 |
Sorry, messed up indentation on last patch, trying again... |
|
|
msg109549 - (view) |
Author: Andrew Clegg (andrewclegg) * |
Date: 2010-07-08 15:08 |
Attached is a patch and unit test against release27-maint. |
|
|
msg109552 - (view) |
Author: R. David Murray (r.david.murray) *  |
Date: 2010-07-08 16:14 |
Thanks for working on this. I tried the py3k patch but it doesn't apply cleanly. Patch says "patch: **** malformed patch at line 57:" followed by an apparently blank line. Also, could you please generate the patches from the top level of the checkout? It makes it easier to apply patches if they all have the same starting point. |
|
|
msg109704 - (view) |
Author: Andrew Clegg (andrewclegg) * |
Date: 2010-07-09 09:57 |
OK, regenerated both patches from the top level of the checkout (paths within the patch are 'Lib/fnmatch.py' and 'Lib/test/test_fnmatch.py'. Sorry about the corrupted patch before - I've tested this version by reverting my changes and reapplying it, and it seems to work cleanly. Cheers |
|
|
msg109708 - (view) |
Author: Antoine Pitrou (pitrou) *  |
Date: 2010-07-09 10:41 |
Instead of writing `bytes('foo', encoding='ascii')`, you can just write `b'foo'`. Otherwise, patch looks ok. |
|
|
msg109710 - (view) |
Author: Andrew Clegg (andrewclegg) * |
Date: 2010-07-09 10:55 |
Updated py3k patch to use the more concise bytes syntax, cheers. |
|
|
msg109734 - (view) |
Author: R. David Murray (r.david.murray) *  |
Date: 2010-07-09 13:37 |
Committed to py3k in r82730, 3.1 in r82733, 2.7 in r82732, and, since 2.6.6 is still to come, 2.6 in r82740. Andrew: FYI I changed the assertTrue to assertLessEqual to take advantage of the better failure messages provided by the newer unittest methods. Thanks for the patch. |
|
|
msg113358 - (view) |
Author: Raymond Hettinger (rhettinger) *  |
Date: 2010-08-09 02:07 |
For Py3.2, applying the new functools.lru_cache(). See r83871. |
|
|
msg139454 - (view) |
Author: Tom Rini (Tom.Rini) |
Date: 2011-06-29 22:59 |
Did a change later make this user-configurable? I've got some code here that now runs so slow as to be unusable because nothing is ever cached anymore. Thanks! |
|
|
msg139486 - (view) |
Author: R. David Murray (r.david.murray) *  |
Date: 2011-06-30 15:07 |
Well, it's unnoficially configurable pre-3.2 (set _MAXCACHE on the module). The lru_cache implementation doesn't have such an undocumented way of tweaking the limit. In neither case, however, is it true that "nothing is ever cached". You are saying that you are dealing with more than 100 patterns in a single application? (256 if you are using 3.2, and in that case it is an LRU cache and so should perform well unless your more-than-256 patterns are used in a cyclical fashion.) |
|
|
msg139494 - (view) |
Author: R. David Murray (r.david.murray) *  |
Date: 2011-06-30 15:38 |
Grr. "it is *not* true that nothing is ever cached". |
|
|
msg139623 - (view) |
Author: Tom Rini (Tom.Rini) |
Date: 2011-07-02 03:17 |
So yes, we have some code that was doing, roughly: if any(fnmatchcase(key, pat) for pat in pattern): refs.add(key) on 200-300 elements, a lot. That said, many of them were not globs so we've worked around this to only use fnmatchcase on globs and just key in set otherwise. Things are now usable on python 2.6.6+ (so Ubuntu 10.10) and noise otherwise. |
|
|