Issue 19329: Faster compiling of charset regexpes (original) (raw)

Here is a patch which speed up compiling of regular expressions with big charsets.

Microbenchmark: $ ./python -m timeit "from sre_compile import compile; r = '[%s]' % ''.join(map(chr, range(256, 2**16, 255)))" "compile(r, 0)"

Unpatched (but with fixed ): 119 msec per loop Patched: 59.6 msec per loop

Compiling regular expressions with big charset was main cause of slowing down importing the email.message module ().

Here is a more complex patch which optimizes charset compiling. It affects small charsets too. Big charsets now supports same optimizations as small charsets. Optimized bitmap now can be used even if the charset contains category items or non-bmp characters.

$ ./python -m timeit "from sre_compile import compile; r = '[0-9]+'" "compile(r, 0)" Unpatched: 1000 loops, best of 3: 457 usec per loop Patched: 1000 loops, best of 3: 368 usec per loop $ ./python -m timeit "from sre_compile import compile; r = '[ \t\n\r\v\f]+'" "compile(r, 0)" Unpatched: 1000 loops, best of 3: 490 usec per loop Patched: 1000 loops, best of 3: 413 usec per loop $ ./python -m timeit "from sre_compile import compile; r = '[0-9A-Za-z_]+'" "compile(r, 0)" Unpatched: 1000 loops, best of 3: 760 usec per loop Patched: 1000 loops, best of 3: 527 usec per loop $ ./python -m timeit "from sre_compile import compile; r = r'[^\ud800-\udfff]*'" "compile(r, 0)" Unpatched: 100 loops, best of 3: 2.07 msec per loop Patched: 1000 loops, best of 3: 1.44 msec per loop $ ./python -m timeit "from sre_compile import compile; r = '[\u0410-\u042f\u0430-\u043f\u0404\u0406\u0407\u0454\u0456\u0457\u0490\u0491]+'" "compile(r, 0)" Unpatched: 100 loops, best of 3: 8.24 msec per loop Patched: 100 loops, best of 3: 2.13 msec per loop $ ./python -m timeit "from sre_compile import compile; r = '[%s]' % ''.join(map(chr, range(256, 2**16, 255)))" "compile(r, 0)" Unpatched: 10 loops, best of 3: 119 msec per loop Patched: 10 loops, best of 3: 24.1 msec per loop