Issue 24896: It is undocumented that re.UNICODE and re.LOCALE affect re.IGNORECASE (original) (raw)

A non-ASCII string does not match a regular expression case-insensitively unless the UNICODE flag is set. This seems reasonable, but the documentation seems to imply that this is not the case.

The example:

import re
# Does not match
re.compile(u"неоднозначность", re.IGNORECASE) \
        .findall(u"Неоднозначность") 
# Matches
re.compile(u"неоднозначность", re.IGNORECASE | re.UNICODE) \
        .findall(u"Неоднозначность")

(In Python 3, it does not match if re.ASCII is given.)

The documentation (2.7) says:

re.UNICODE

Make \w, \W, \b, \B, \d, \D, \s and \S dependent on the Unicode character
properties database.

(https://docs.python.org/2/library/re.html#re.UNICODE)

My regex does not use any of those escapes, yet the regex changes behavior with the UNICODE flag. This leads to confusion when the regex doesn't match. The documentation is very specific about the behavior that changes with the flag, implying that behavior not mentioned is unaffected.

Of course, it's easy to guess the correct (hopefully) solution.

Still, I suggest changing the documentation to mention that re.IGNORECASE is affected. Looking at the source code, there seems to be further consequences (it mentions "Unicode locale") which may also warrant a mention. If you do want to avoid specifics, however, even a hand-wavy reference to something like "match according to Unicode" would help, because it implies that not only the escapes change behavior.

In Python 3, there is a counterpart to the 2.7 problem: re.ASCII makes our Cyrillic string not match. Again, this behavior makes intuitive sense, but the documentation seems to indicate something different:

re.ASCII
Make \w, \W, \b, \B, \d, \D, \s and \S perform ASCII-only matching instead
of full Unicode matching. This is only meaningful for Unicode patterns, and
is ignored for byte patterns.

…

re.IGNORECASE
Perform case-insensitive matching; expressions like [A-Z] will match
lowercase letters, too. This is not affected by the current locale and
works for Unicode characters as expected.

re.ASCII does appear to affect re.IGNORECASE. Since this is the non-default case, however, I'm not sure it's worth calling it out. I'd be happy even if only the 2.7 docs change.

Actually the locale affects case-insensitively matching if use the re.LOCAL flag. The set of characters matched by b'[A-Z]' is locale-depending. For example in Turkish locale it can include the letters 'İ' and 'ı'. Only 8-bit locales are supported, not UTF-8 locales.

In Unicode case-insensitive mode the expression '[A-Z]' matches not only Latin uppercase and lowercacase letters A-Z and a-z, but also characters 'İ', 'ı', 'ſ', and 'K'.