Issue 30349: Preparation for advanced set syntax in regular expressions (original) (raw)
Currently the re module supports only simple sets. They can include literal characters, character ranges, some simple character classes and support the negation. The Unicode standard [1] defines set operations (union, intersection, difference and symmetric difference) and nested sets. Some regular expression engines implemented these features, for example the regex module supports all TR18 features except not-nested POSIX character classes.
If replace the re module with the regex module or add support of these features in the re module and make this syntax enabled by default, this will break some code. It is very unlikely the the regular expression contains duplicated characters ('--', '||', '&&' or '~~'), but nested sets uses just '[', and non-escaped '[' is occurred in character sets in regular expressions (even the stdlib contains several occurrences).
Proposed patch adds FutureWarnings emitted when possible breaking set construct ('--', '||', '&&', '~~' or '[') is occurred in a regular expression. We need one or two releases with a warning before changing syntax. The patch also makes re.escape() escaping '&' and '~' and fixes several regular expression in the stdlib.
Alternatively the support of new set syntax could be enabled by special flag.
I'm not sure that the support of set operations and nested sets is necessary. This complicates the syntax of regular expressions (which already is not simple). Currently set operations can be emulated with lookarounds:
[set1||set2] -- (?:[set1]|[set2]) [set1&&set2] -- set1 or (?=[set1])[set2] [set1--set2] -- set1 or set1 or (?=[set1])[^set2] [set1~~set2] -- recursively expand [[set1||set2]--[set1&&set2]]
[1] http://unicode.org/reports/tr18/#Subtraction_and_Intersection
It might be worth adding part of the problematic regex to the warning message. For Django's tests, I see an error like "FutureWarning: Possible nested set at position 17 return re.compile(res).match". It took some effort to track down the source.
A partial traceback is: File "/home/tim/code/django/django/core/management/commands/loaddata.py", line 247, in find_fixtures for candidate in glob.iglob(glob.escape(path) + '*'): File "/home/tim/code/cpython/Lib/glob.py", line 72, in _iglob for name in glob_in_dir(dirname, basename, dironly): File "/home/tim/code/cpython/Lib/glob.py", line 83, in _glob1 return fnmatch.filter(names, pattern) File "/home/tim/code/cpython/Lib/fnmatch.py", line 52, in filter match = _compile_pattern(pat) File "/home/tim/code/cpython/Lib/fnmatch.py", line 46, in _compile_pattern return re.compile(res).match File "/home/tim/code/cpython/Lib/re.py", line 240, in compile return _compile(pattern, flags) File "/home/tim/code/cpython/Lib/re.py", line 292, in _compile p = sre_compile.compile(pattern, flags) File "/home/tim/code/cpython/Lib/sre_compile.py", line 764, in compile p = sre_parse.parse(p, flags) File "/home/tim/code/cpython/Lib/sre_parse.py", line 930, in parse p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, 0) File "/home/tim/code/cpython/Lib/sre_parse.py", line 426, in _parse_sub not nested and not items)) File "/home/tim/code/cpython/Lib/sre_parse.py", line 816, in _parse p = _parse_sub(source, state, sub_verbose, nested + 1) File "/home/tim/code/cpython/Lib/sre_parse.py", line 426, in _parse_sub not nested and not items)) File "/home/tim/code/cpython/Lib/sre_parse.py", line 524, in _parse FutureWarning, stacklevel=nested + 6 FutureWarning: Possible nested set at position 17
As an aside, I'm not sure how to fix the warning in Django. It comes from the test added in https://github.com/django/django/commit/98df288ddaba9787e4a370f12aba51c2b9133142 where a path like 'tests/fixtures/fixtures/fixture_with[special]chars' is run through glob.escape() which creates 'tests/fixtures/fixtures/fixture_with[[]special]chars'.