regex (original) (raw)
Note
For testing and comparison with the current ‘re’ module the new implementation is in the form of a module called ‘regex’.
Flags
There are 2 kinds of flag: scoped and global. Scoped flags can apply to only part of a pattern and can be turned on or off; global flags apply to the entire pattern and can only be turned on.
The scoped flags are: IGNORECASE, MULTILINE, DOTALL, VERBOSE.
The global flags are: ASCII, LOCALE, REVERSE, UNICODE, ZEROWIDTH.
Additional features
- Atomic grouping (issue #433030)
(?>...)
If the following pattern subsequently fails, then the subpattern as a whole will fail.
- Possessive quantifiers.
(?:...)?+ ; (?:...)*+ ; (?:...)++ ; (?:...){min,max}+
The subpattern is matched up to ‘max’ times. If the following pattern subsequently fails, then all of the repeated subpatterns will fail as a whole. For example, (?:...)++ is equivalent to (?>(?:...)+). - Scoped flags (issue #433028)
(?flags-flags:...)
The flags will apply only to the subpattern. Flags can be turned on or off.
- Inline flags (#433024, #433027)
(?flags-flags)
The flags will apply to the end of the group or pattern. Flags can be turned on or off.
- Repeated repeats (#2537)
A regex like ((x|y+)*)* will be accepted and will work correctly, but should complete more quickly.
- Definition of ‘word’ character (#1693050)
The definition of a ‘word’ character has been expanded for Unicode. This applies to \w, \W, \b and \B.
- Groups in lookahead and lookbehind (#814253)
Groups and group references are permitted in both lookahead and lookbehind.
- Variable-length lookbehind
A lookbehind can match a variable-length string.
- Correct handling of charset with ignore case flag (#3511)
Ranges within charsets are handled correctly when the ignore-case flag is turned on.
- Unmatched group in replacement (#1519638)
An unmatched group is treated as an empty string in a replacement template.
- ‘Pathological’ patterns (#1566086, #1662581, #1448325, #1721518, #1297193)
‘Pathological’ patterns should complete more quickly.
- Flags argument for regex.split, regex.sub and regex.subn (#3482)
regex.split, regex.sub and regex.subn support a ‘flags’ argument.
- ‘Overlapped’ argument for regex.findall and regex.finditer
regex.findall and regex.finditer support an ‘overlapped’ flag which permits overlapped matches.
- Unicode escapes (#3665)
The Unicode escapes \uxxxx and \Uxxxxxxxx are supported.
- Large patterns (#1160)
Patterns can be much larger.
- Zero-width match with regex.finditer (#1647489)
regex.finditer behaves correctly when it splits at a zero-width match.
- Zero-width split with regex.split (#3262)
regex.split can split at a zero-width match if the zero-width flag is turned on. When the flag is turned off the current behaviour is unchanged because the BDFL thinks that some existing software might depend on it.
- Splititer
regex.splititer has been added. It’s a generator equivalent of regex.split.
- Subscripting for groups
A match object accepts access to the captured groups via subscripting and slicing:
m = regex.search(r"(?.?)(?\d+)(?.)", "pqr123stu")
print m["before"]
pqr
print m["num"]
123
print m["after"]
stu
print len(m)
4
print m[:]
('pqr123stu', 'pqr', '123', 'stu') - Named groups
Named groups can be named with (?...) as well as the current (?P...).
- Group references
Groups can be referenced within a pattern with \g. This also allows there to be more than 99 groups.
- Named characters
\N{name}
Named characters are supported.
- Unicode codepoint properties, blocks and scripts
\p{name} ; \P{name}
Unicode properties, blocks and scripts are supported. \p{name} matches a character which has property ‘name’ and \P{name} matches a character which doesn’t have property ‘name’.
In order to avoid ambiguity, block names should start with In and script names should start with Is. If a name lacks such a prefix and it could be a block or a script, script will take priority, for example:
- InBasicLatin or BasicLatin, the ‘BasicLatin’ block.
- IsLatin or Latin, the ‘Latin’ script.
- InCyrillic, the ‘Cyrillic’ block.
- IsCyrillic or Cyrillic, the ‘Cyrillic’ script.
- Posix character classes
[[:alpha:]]
Posix character classes are supported.
- Search anchor
\G
A search anchor has been added. It matches at the position where each search started/continued and can be used for contiguous matches or in negative variable-length lookbehinds to limit how far back the lookbehind goes:
regex.findall(r"\w{2}", "abcd ef")
['ab', 'cd', 'ef']
regex.findall(r"\G\w{2}", "abcd ef")
['ab', 'cd']- The search starts at position 0 and matches 2 letters ‘ab’.
- The search continues at position 2 and matches 2 letters ‘cd’.
- The search continues at position 4 and fails to match any letters.
- The anchor stops the search start position from being advanced, so there are no more results.
- Reverse searching
Searches can now work backwards:
regex.findall(r".", "abc")
['a', 'b', 'c']
regex.findall(r"(?r).", "abc")
['c', 'b', 'a']Note: the result of a reverse search is not necessarily the reverse of a forward search:
regex.findall(r"..", "abcde")
['ab', 'cd']
regex.findall(r"(?r)..", "abcde")
['de', 'bc'] - Multithreading
The regex module now releases the GIL when matching, enabling other Python threads to run concurrently.
- Matching a single grapheme
\X
The grapheme matcher is supported. It’s equivalent to \P{M}\p{M}*.