msg227055 - (view) |
Author: Yury Selivanov (yselivanov) *  |
Date: 2014-09-18 17:39 |
While writing a lexer for javascript language, I managed to hit the limit of named groups in one regexp, it's 100. The check is in sre_compile.py:compile() function, and there is even an XXX comment on this. Unfortunately, I'm not an expert in this module, so I'm not sure if this check can be lifted, or at least if the number can be bumped to 200 or 500 (why is 100 btw?) Please share your thoughts. |
|
|
msg227058 - (view) |
Author: R. David Murray (r.david.murray) *  |
Date: 2014-09-18 18:04 |
It is 100 to avoid a syntactic ambiguity between numbered groups and octal numbers, if I remember correctly. I can't remember if that constraint still applies in python3, where the octal notation was made more strict in general. |
|
|
msg227060 - (view) |
Author: Matthew Barnett (mrabarnett) *  |
Date: 2014-09-18 18:54 |
In the regex module, I borrowed the \g<...> escape from .sub's replacement string to provide an alternative way to refer to a group in a pattern, and that let me remove the limit. |
|
|
msg227063 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) *  |
Date: 2014-09-18 20:36 |
There is two reasons for this limitation. First reason is mentioned by David. There is no syntax to backreference a group with number > 99 (but there is a syntax for conditional groups and for substitutions). Second reason is that current implementation of regexp engine uses an array of constant size for groups. Here is a patch which increases static limit to 1000 groups. It also allows to specify arbitrary group number in form of "(?P=number)". This is conformed to the syntax of conditional groups and for substitutions. |
|
|
msg227064 - (view) |
Author: Yury Selivanov (yselivanov) *  |
Date: 2014-09-18 20:53 |
Serhiy, This is awesome! Is is possible to split the patch in two, and commit the one that just increases the groups limit to 3.4 as well? Thank you |
|
|
msg227066 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) *  |
Date: 2014-09-18 21:13 |
This is definitely not a bug fix. May be Matthew will commit it to the regex module and then you could use regex instead of re. |
|
|
msg227237 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) *  |
Date: 2014-09-21 20:49 |
Here is a patch which removes static limit. It is much more complicated than the first patch and I prefer first apply the first patch. Aren't 1000 groups enough for everyone? |
|
|
msg227635 - (view) |
Author: Yury Selivanov (yselivanov) *  |
Date: 2014-09-26 16:51 |
I'm fine with either one, Serhiy. The static one looks good to me. |
|
|
msg227820 - (view) |
Author: Roundup Robot (python-dev)  |
Date: 2014-09-29 19:50 |
New changeset 0b85ea4bd1af by Serhiy Storchaka in branch 'default': Issue #22437: Number of capturing groups in regular expression is no longer https://hg.python.org/cpython/rev/0b85ea4bd1af |
|
|
msg227825 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) *  |
Date: 2014-09-29 20:15 |
Thank you Antoine for your review. To avoid discrepancy between re and regex (and other engines), I have committed only a part of dynamic patch, without adding support of backreferences with index over 99. It is unlikely to achieve this limit in hand written regular expression, and in generated regular expression you can use named groups. I found that backreference syntax is one of most discrepant thing in regular expressions. There are at least 8 different variants (\N, \gN, \g, \g{N}, \k, \k'N', \k{N}, (?P=N)), and \g in Perl have different meaning. |
|
|