[Python-Dev] Some questions about maintenance of the regular expression code. (original) (raw)

Gary Herron gherron@islandtraining.com
Wed, 26 Feb 2003 00:22:15 -0800


I've decided to answer Guido's call for someone to take over maintenance of the SRE code since it has started to fall into disrepair. First a short introduction and then on with a question that begs for some discussion on this list.

My name is Gary Herron. I've been using Python whenever possible for about 8 years (and for most of the last year and a half I've been able to choose Python almost exclusively -- lucky me). I've mostly lurked around the python and python-dev lists, only occasionally offering help or comments. Volunteering to maintain the SRE code seems like a good opportunity to jump in and do something useful.

Now on with the questions at hand:

The first glance at the regular expression bug list and the _sre.c code results in the observation that several of the bugs are related to running over the recursion limit. The problem comes from using a pattern containing ".?" in a situation where it is expected to match many thousands of characters. Each character matched by ".?" causes one level or recursion, quickly overflowing the recursion limit.

Question 1: Should we even consider these as bugs?

After all the recursion limit is in place to prevent badly used re's from crashing Python with a stack overflow. We could claim the kinds of patterns which cause heavy recursion are miss-uses of regular expressions which are bound to fail when used on long strings. If we take this route, something should be added to the documentation which explains when excessive recursion is likely to bite.

Question 2: If we want to solve the problem (instead of just dodging it) how should we proceed?

Any comments? Remember that all the points under question 2 are worth considering only if we decide we really ought to support things like patterns using ".*?" to match many thousands of characters.

Thanks, Gary Herron