[Python-Dev] Some questions about maintenance of the regular expression code. (original) (raw)
Gary Herron gherron@islandtraining.com
Wed, 26 Feb 2003 00:22:15 -0800
- Previous message: [Python-Dev] test_re failing again on Mac OS X
- Next message: [Python-Dev] Some questions about maintenance of the regular expression code.
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
I've decided to answer Guido's call for someone to take over maintenance of the SRE code since it has started to fall into disrepair. First a short introduction and then on with a question that begs for some discussion on this list.
My name is Gary Herron. I've been using Python whenever possible for about 8 years (and for most of the last year and a half I've been able to choose Python almost exclusively -- lucky me). I've mostly lurked around the python and python-dev lists, only occasionally offering help or comments. Volunteering to maintain the SRE code seems like a good opportunity to jump in and do something useful.
Now on with the questions at hand:
The first glance at the regular expression bug list and the _sre.c code results in the observation that several of the bugs are related to running over the recursion limit. The problem comes from using a pattern containing ".?" in a situation where it is expected to match many thousands of characters. Each character matched by ".?" causes one level or recursion, quickly overflowing the recursion limit.
Question 1: Should we even consider these as bugs?
After all the recursion limit is in place to prevent badly used re's from crashing Python with a stack overflow. We could claim the kinds of patterns which cause heavy recursion are miss-uses of regular expressions which are bound to fail when used on long strings. If we take this route, something should be added to the documentation which explains when excessive recursion is likely to bite.
Question 2: If we want to solve the problem (instead of just dodging it) how should we proceed?
Increasing the limit beyond the current 10000 is not really an option for two reasons:
This doesn't solve the problem. One can always match on a string purposely chosen to be long enough to overflow any recursion limit.
A recent patch (browse "cvs log _sre.c" to find a reference) actually lowered the limit from 10000 to 7500 for certain 64-bit machines which apparently suffered a stack overflow before hitting 10000 recursion levels.
An attempt to replace the hard-coded upper limit with a programmed check of the stack space (see Misc/HISTORY for a reference to PyOS_CheckStack) was added and then withdrawn for version 2.0. Does anybody know the history of this? This would not really solve the problem (especially on the 64 bit machines which could not even hit 10000 levels of recursion), but it would push the recursion limit to its highest possible value rather than some arbitrary hard-coded value.
Removing the recursion by the standard method of storing state in a program managed stack and looping rather than recursing would push the storage problem from the stack into the (probably much larger) heap. I haven't looked at the code enough to judge if this is feasible, but if it is, some limit would still remain. It would, however, depend on available memory rather than stack space. And still, the documentation should warn that certain naive pattens on LONG strings could fail after wasting much time chewing through all available memory.
I notice that, unlike pattern ".?", matching to pattern "." does not recurse for each character matched. With only a few minutes of looking at the code, I can't begin to guess if it is feasible to make the former work like the later without recursing.
Any comments? Remember that all the points under question 2 are worth considering only if we decide we really ought to support things like patterns using ".*?" to match many thousands of characters.
Thanks, Gary Herron
- Previous message: [Python-Dev] test_re failing again on Mac OS X
- Next message: [Python-Dev] Some questions about maintenance of the regular expression code.
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]