[Python-Dev] Zero-width matching in regexes (original) (raw)
Serhiy Storchaka storchaka at gmail.com
Wed Dec 6 09:15:12 EST 2017
- Previous message (by thread): [Python-Dev] Zero-width matching in regexes
- Next message (by thread): [Python-Dev] Zero-width matching in regexes
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
06.12.17 15:37, Paul Moore пише:
Behaviour (1) means that we'd get
>>> regex.sub(r'\w*', 'x', 'hello world', flags=regex.VERSION1) 'xx xx' (because \w* matches the empty string after each word, as well as each word itself). I just tested in Perl, and that is indeed what happens there as well.
Yes, because in this case you need to use \w+
, not \w*
.
No CPython tests will be failed if change re.sub() to behaviour (2) except just added in 3.7 tests and the one test specially purposed to guard the old behavior. But I don't know how much third party code will be broken if made this change.
On that basis, I have to say that I find behaviour (2) more intuitive and (arguably) "correct":
>>> regex.sub(r'\w*', 'x', 'hello world', flags=regex.VERSION0) 'x x' >>> re.sub(r'\w*', 'x', 'hello world') 'x x'
The actual behavior of re.sub() and regex.sub() in the VERSION0 mode was a weird behavior (4).
regex.sub(r'(\b|\w+)', r'[\1]', 'hello world', flags=regex.VERSION0) '[]h[ello] []w[orld]' regex.sub(r'(\b|\w+)', r'[\1]', 'hello world', flags=regex.VERSION1) '[][hello][] [][world][]' re.sub(r'(\b|\w+)', r'[\1]', 'hello world') # 3.6, behavior (4) '[]h[ello] []w[orld]' re.sub(r'(\b|\w+)', r'[\1]', 'hello world') # 3.7, behavior (2) '[][hello] [][world]'
- Previous message (by thread): [Python-Dev] Zero-width matching in regexes
- Next message (by thread): [Python-Dev] Zero-width matching in regexes
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]