Message 94068 - Python tracker (original) (raw)
Anyway, there are ways to speedup regexps, even without instructing the regexps with anti-backtracking syntaxes.
See http://swtch.com/~rsc/regexp/regexp1.html (article dated January 2007) Which discusses how Perl, PCRE (and PHP), Python, Java, Ruby, .NET library... are slow because they are using backtracking a single state in the NFA, instead of using simultaneously active states (which correctly emulates the DFA, without having to actually build the DFA transition states table, which could grow combinatorially, as seen in yacc and Bison).
Java uses now the Thomson approach in its latest releases, but I wonder how Python works: does it use the DFA simulation? Do you use PCRE?
Note that I've been using the DFA simulation since more than 20 years in 1987, when I built my first regexp matcher (because the existing implementation at that time were really damn slow), after I read the Aho/Sethi/Ullman book which already demonstrated the algorithm.
This algorithm has been implemented in some tools replacing the old yacc/Bison tools, because they generate huge DFA transition tables (and this was the reason why Lex and Yacc were maintained as separate tools, Lex using the single-state NFA approach with excessive backtracking, and Yacc/Bison trying to generate the full DFA transition tables.) : the first language to use this approach was the Purdue Univerity Compiler Construction Tools Set (PCCTS) which was initially written in C and is now fully written and supported in Java.
The Perl 6 extension for quantified capturing groups will have a slow adoption, as long as Perl will continue the slow single-state NFA approach with excessive backtracking, instead of the Aho/Sethi/Ullman (ASU) approach (that some are attributing to Thomson due to the article in 2007, but this is false) using simultaneously active states. But anyway, it already exists (and Perl developers are already working on rewriting the engine using the ASU approach).
But my suggstion is much more general, as it should not just apply to quantified capturing groups, but to any capturing group that is part of a subexpression which is quantified.
And the way I specified it, it does not depend on the way the engine is written (whever it uses a single-state NFA or multi-state NFA or generates and uses a DFA transition state with single-state like in Yacc/Bison), because capturing groups are just used to store position pairs, and regexp engines already have to count them for effectively matching the greedy or non-greedy quantifiers, so this immediately provides a usable index for storing at successive positions in a numbered array for captured groups.
The simple test case is effectively to try to match /(aa?)*a+/ with strings longer than a few dozens of 'a'.