Issue 10703: Regex 0.1.20101210 - Python tracker (original) (raw)

The regex package doesn't seem to correctly implement the single grapheme match "\X" (\P{M}\p{M}*) for pre-Python 3. I'm using the string "íi-te" (i, U+0301, i, -, t, e -- where U+0301 is Unicode COMBINING ACUTE ACCENT), reading it in from a file to bypass Unicode c&p issues in the older IDLEs).

stiv@x$ python3.1 Python 3.1.2 (r312:79147, May 19 2010, 11:50:28) [GCC 4.1.2 20080704 (Red Hat 4.1.2-46)] on linux2 Type "help", "copyright", "credits" or "license" for more information.

import regex file = open("test_data", "rt", encoding="utf-8") s = file.readline() print (s) íi-te print (g.findall(s)) ['í', 'i', '-', 't', 'e']

stiv@x$ python2.7 Python 2.7 (r27:82500, Oct 4 2010, 14:49:53) [GCC 4.1.2 20080704 (Red Hat 4.1.2-48)] on linux2 Type "help", "copyright", "credits" or "license" for more information.

import codecs
import regex file = codecs.open("test_data", "r", "utf-8") g = regex.compile("\X") s = file.readline() s u'i\u0301i-te' print s.encode("utf-8") íi-te print g.findall(s) [u'i', u'\u0301', u'i', u'-', u't', u'e']

*Not correct -- accent is treated as a separate character.

Thanks.

The regex module is intended to replace the re module, so its default behaviour is the same: in Python 2, regexes default to matching ASCII, and in Python 3, they default to matching Unicode.

If you want to use a regex on a Unicode string in Python 2 then you need to set the Unicode flag, either by providing the UNICODE flag or by putting "(?u)" in the regex itself.

(Forehead slap.)

On Tue, 14 Dec 2010, Matthew Barnett wrote:

Matthew Barnett <python@mrabarnett.plus.com> added the comment:

The regex module is intended to replace the re module, so its default behaviour is the same: in Python 2, regexes default to matching ASCII, and in Python 3, they default to matching Unicode.

If you want to use a regex on a Unicode string in Python 2 then you need to set the Unicode flag, either by providing the UNICODE flag or by putting "(?u)" in the regex itself.



Python tracker <report@bugs.python.org> <http://bugs.python.org/issue10703>