More strict rules for group numbers and names in RE (original) (raw)

There were unintentional changes in parsing regular expressions between Python 2 and Python 3.

Group references.
In patterns and replacement strings you can refer a group by its number using syntax \N where N is a 1-2 digit decimal number. The number should not start by 0, because it will be in an octal escape sequence. The group number can also be used in the conditional expression (?(N)...) in patterns and in references \g<N> in replacement strings. And it is interesting, that in Python 3 it can be not only a sequence of decimal digits. The following things are allowed in the group number:
- Initial zero: \g<01>.
- Spaces around the number: \g< 1 >.
- Underscores: \g<1_2>.
- Non-decimal digits: \g<¹>.
- Non-ASCII decimal digits: \g<१>.
  All this is purely an implementation artifact. After \g< we search the nearest > and pass a substring between < and > to int(). In other implementation we could search the longest sequence of decimal digits and all above examples (except may be the first one) would be filtered out automatically.
Group names.
In (?P<name>...), (?P=name), (?(name)...) and \g<name> we can refer groups by name. To avoid ambiguity there is a limitation: the name should follow the rules for identifier. In Python 2 it means that it should contain only letters, digits and underscores and start with a non-digit. Letters and digits are ASCII-only: [A-Za-z] and [0-9].
In Python 3 identifiers can contain non-ASCII letters and digits. It is good. But in bytes patterns and replacement strings the codes \xaa, \xb2, \xb3, \xb5, \xb9, \xba, \xc0-\xd6, \xd8-\xf6, \xf8-\xff are allowed in the group name. They correspond characters ª²³µ¹ºÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ after decoding.
It is an implementation artifact too. Bytes patterns and replacement strings are decoded with the Latin1 encoding for parsing. It simplifies and speeds up the code. There is no other reason why letters and digits in the range U-0080--U-00FF are allowed.
Note that In Python 3 the bytes literal can only contain printable literal characters in the ASCII range. Codes outside of this range should be represented as octal or hexadecimal escape sequences. So supporting non-ASCII letters and digits does not add to readability.

Since the above "features" are not intentional, not supported by most other RE engines (except regex, which is also written in Python), are not tested, and can be changed in result of refactoring the parser, I suggest to introduce more strict rules on group number and name.

Group number should only contain ASCII decimal digits in range [0-9]. Initial 0 is not allowed except for group number 0.
Group name in the bytes pattern or replacement string should only contain ASCII letters and digits.

The question: do we need a deprecation period for this? I have wrote a code for both options (with deprecation and with error), will create PRs tomorrow.