Issue 1611131: \b in unicode regex gives strange results (original) (raw)

The problem: This doesn't give a match:

re.match(r'ä\b', 'ä ', re.UNICODE)

This works ok and gives a match:

re.match(r'.\b', 'ä ', re.UNICODE)

Both of these work as well:

re.match(r'a\b', 'a ', re.UNICODE) re.match(r'.\b', 'a ', re.UNICODE)

Docs say \b is defined as an empty string between \w and \W. These do match accordingly:

re.match(r'\w', 'ä', re.UNICODE) re.match(r'\w', 'a', re.UNICODE) re.match(r'\W', ' ', re.UNICODE)

So something strange happens in my first example, and I can't help but assume it's a bug.

Ok so this does work:

re.match(ur'ä\b', u'ä ', re.UNICODE)

If I understand correctly, I was comparing UTF-8 encoded strings in my examples (my Ubuntu is UTF-8 by default) and regex special operators just don't work in that domain.