[Python-Dev] (no subject) (original) (raw)

Franklin? Lee leewangzhong+python at gmail.com
Wed Dec 27 01:55:58 EST 2017


On Tue, Dec 26, 2017 at 2:01 AM, Yogev Hendel <yogev at intsights.com> wrote:

I don't know if this is the right place to put this, but I've found the following lines of code results in an incredibly long processing time. Perhaps it might be of use to someone. import re pat = re.compile('^/?(?:\w+)/(?:[%\w-]+/?)+/?$') pat.match('/t/a-aa-aa-aaaaa-aa-aa-aa-aa-aa-aa./')

(I think the correct place is python-list. python-dev is primarily for the developers of Python itself. python-ideas is for proposing new features and changes to the language. python-list is for general discussion. Bug reports and feature requests belong in https://bugs.python.org/ (where your post could also have gone).)

The textbook regular expression algorithm (which I believe grep uses) runs in linear time with respect to the text length. The algorithm used by Perl, Java, Python, JavaScript, Ruby, and many other languages instead use a backtracking algorithm, which can run up to exponential time with respect to text length. This worst-case is in fact necessary (assuming P != NP): Perl allows (introduced?) backreferences, which are NP-hard[1]. Perl also added some other features which complicate things, but backreferences are enough.

The user-level solution is to understand how regexes are executed, and to work around it.

Here are library-level solutions for your example:

  1. Perl now has a regex optimizer, which will eliminate some redundancies. Something similar can be added to Python, at first as a third-party library.
  2. In theory, we can use the textbook algorithm when possible, and the backtracking algorithm when necessary. However, the textbook version won't necessarily be faster, and may take more time to create, so there's a tradeoff here.
  3. To go even further, I believe it's possible to use the textbook algorithm for subexpressions, while the overall expression uses backtracking, internally iterating through the matches of the textbook algorithm.

There's a series of articles by Russ Cox that try to get us back to the textbook (see [2]). He and others implemented the ideas in the C++ library RE2[3], which has Python bindings[4]. RE2 was made for and used on Google Code Search[5] (described in his articles), a (now discontinued) search engine for open-source repos which allowed regular expressions in the queries.

You can get a whiff of the limitations of the textbook algorithm by checking out RE2's syntax[6] and seeing what features aren't supported, though some features may be unsupported for different reasons (such as being redundant syntax).

(Apologies: I am making up reference syntax on-the-fly.) [1] "Perl Regular Expression Matching is NP-Hard" https://perl.plover.com/NPC/ [2] "Regular Expression Matching Can Be Simple And Fast" https://swtch.com/~rsc/regexp/regexp1.html "Regular Expression Matching: the Virtual Machine Approach" https://swtch.com/~rsc/regexp/regexp2.html "Regular Expression Matching in the Wild" https://swtch.com/~rsc/regexp/regexp3.html "Regular Expression Matching with a Trigram Index" https://swtch.com/~rsc/regexp/regexp4.html [3] RE2: https://github.com/google/re2 [4] pyre2: https://github.com/facebook/pyre2/ Also see re2 and re3 on PyPI, which intend to be a drop-in replacement. re3 is a Py3-compatible fork of re2, which last updated in 2015. [5] https://en.wikipedia.org/wiki/Google_Code_Search [6] https://github.com/google/re2/wiki/Syntax [7] Quote: "As a matter of principle, RE2 does not support constructs for which only backtracking solutions are known to exist. Thus, backreferences and look-around assertions are not supported." https://github.com/google/re2/wiki/WhyRE2



More information about the Python-Dev mailing list