Proposed patch makes the regular expression parser produce more optimal tree, mainly due to getting rid of non-capturing groups. This allows to apply an optimization that was forbidden before and makes the regular expression compiler producing more efficient code. For example following expressions are transformed in more optimal form: '(?:x|y)+' -> '[xy]+' '(?:ab)
(?:ac)' -> 'a[bc]' r'[a-z]
\d' -> r'[a-z\d]' This can speed up matching by 10-25 times. $ ./python -m timeit -s "import re; p = re.compile(r'(?:x
$ ./python -m timeit -s "import re; p = re.compile(r'[a-z]|[0-9]'); s = ' '*10000+'x'" "p.search(s)" Unpatched: 500 loops, best of 5: 732 usec per loop Patched: 1000 loops, best of 5: 279 usec per loop
The PR includes defining and using a _uniq function that is actually a no-op function (it doesn't uniquify, the first line returns the argument, so the rest is skipped). Was that supposed to be removed, or should it actually uniqify?