Issue 11947: re.IGNORECASE does not match literal "_" (underscore) (original) (raw)

Regular expressions which are written match literal underscores ("_", ASCII ordinal 95) and specify re.IGNORECASE during compilation do not consistently match underscores: it seems some occurrences are matched, but others are not.

The following session log shows the problem:

Python 2.6.5 (r265:79063, Apr 16 2010, 13:57:41) 
[GCC 4.4.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> subject = "[Conclave-Mendoi]_ef_-_a_tale_of_memories_00-12_H264"
>>> print subject.encode("base64")  # Incase my environment encoding is to blame
W0NvbmNsYXZlLU1lbmRvaV1fZWZfLV9hX3RhbGVfb2ZfbWVtb3JpZXNfMDAtMTJfSDI2NA==

>>> re.sub("_", "X", subject)  # No flags, does what I expect
'[Conclave-Mendoi]XefX-XaXtaleXofXmemoriesX00-12XH264'
>>> 
>>> re.sub("_", "X", subject, re.IGNORECASE)  # Misses some matches
'[Conclave-Mendoi]XefX-_a_tale_of_memories_00-12_H264'
>>> 
>>> re.sub("_", "X", subject, re.IGNORECASE | re.LOCALE)  # Misses fewer matches
'[Conclave-Mendoi]XefX-XaXtaleXofXmemories_00-12_H264'
>>> 
>>> re.sub("_", "X", subject, re.IGNORECASE | re.LOCALE | re.UNICODE)  # Works OK
'[Conclave-Mendoi]XefX-XaXtaleXofXmemoriesX00-12XH264'
>>> 
>>> re.sub("_", "X", subject, re.IGNORECASE | re.UNICODE) # Works OK
'[Conclave-Mendoi]XefX-XaXtaleXofXmemoriesX00-12XH264'
>>> 
>>> type(subject)  # Don't think this is a unicode string
<type 'str'>
>>>

Since my subject variable is of type str and only contains ASCII characters I do not believe that the re.UNICODE flag should be required.

help(re.sub) says:

sub(pattern, repl, string, count=0)

and re.IGNORECASE has a value of 2.

Therefore this:

re.sub("_", "X", subject, re.IGNORECASE)

is telling it to replace at most 2 occurrences of "_".