Issue 1611131: \b in unicode regex gives strange results (original) (raw)
The problem: This doesn't give a match:
re.match(r'ä\b', 'ä ', re.UNICODE)
This works ok and gives a match:
re.match(r'.\b', 'ä ', re.UNICODE)
Both of these work as well:
re.match(r'a\b', 'a ', re.UNICODE) re.match(r'.\b', 'a ', re.UNICODE)
Docs say \b is defined as an empty string between \w and \W. These do match accordingly:
re.match(r'\w', 'ä', re.UNICODE) re.match(r'\w', 'a', re.UNICODE) re.match(r'\W', ' ', re.UNICODE)
So something strange happens in my first example, and I can't help but assume it's a bug.
Ok so this does work:
re.match(ur'ä\b', u'ä ', re.UNICODE)
If I understand correctly, I was comparing UTF-8 encoded strings in my examples (my Ubuntu is UTF-8 by default) and regex special operators just don't work in that domain.