Issue 1243192: Incorrect documentation of re.UNICODE (original) (raw)

The effects of the re.UNICODE flag are incorrectly documented in the library reference. Currently it says (Section 4.2.3):

U UNICODE Make \w, \W, \b, and \B dependent on the Unicode character properties database. New in version 2.0.

But this flag in fact also affects \d, \D, \s, and \S at least since Python 2.1 (I have checked 2.1.3 on Linux, 2.2.3, 2.3.5 and 2.4 on OS X and the source of _sre.c makes this obvious). Proof:

Python 2.4 (#1, Feb 13 2005, 18:29:12) [GCC 3.3 20030304 (Apple Computer, Inc. build 1666)] on darwin Type "help", "copyright", "credits" or "license" for more information.

import re not re.match(r"\d", u"\u0966") True re.match(r"\d", u"\u0966", re.UNICODE) <_sre.SRE_Match object at 0x36ee20> not re.match(r"\s", u"\u2001") True re.match(r"\s", u"\u2001", re.UNICODE) <_sre.SRE_Match object at 0x36ee20>

\u0966 is some Indian digit, \u2001 is an em space.

I propose to change the docs to:

U UNICODE Make \w, \W, \b, \B, \d, \D, \s, and \S dependent on the Unicode character properties database. New in version 2.0.

Maybe the documentation of \d, \D, \s, and \S in section 2.4.1 of the library reference should also be adapted.