Issue 25743: [doc] Clarify exactly what \w matches in UNICODE mode (original) (raw)

Created on 2015-11-27 15:50 by zwol, last changed 2022-04-11 14:58 by admin.

Messages (5)
msg255463 - (view)	Author: Zack Weinberg (zwol) *	Date: 2015-11-27 15:50
The `re` module documentation does not do a good job of explaining exactly what `\w` matches. Quoting https://docs.python.org/3.5/library/re.html : > \w > For Unicode (str) patterns: > Matches Unicode word characters; this includes most characters > that can be part of a word in any language, as well as numbers > and the underscore. Empirically, this appears to mean "everything in Unicode general categories L* and N*, plus U+005F (underscore)". That is a perfectly sensible definition and the documentation should state it in those terms. "Unicode word characters" could mean any number of different things; note for instance that UTS#18 gives a very different definition. (Further reading: https://gist.github.com/zackw/3077f387591376c7bf67 plus links therefrom).
msg255464 - (view)	Author: Andi McClure (Andi McClure)	Date: 2015-11-27 16:14
I would like to request also a clear explanation be given for the documentation in the 2.7 branch. From https://docs.python.org/2.7/library/re.html : "\w ... If UNICODE is set, this will match the characters [0-9_] plus whatever is classified as alphanumeric in the Unicode character properties database" This is ambiguous. Does it mean the "Alphabetic" property from UAX#44? Does it mean something else?
msg255465 - (view)	Author: Zack Weinberg (zwol) *	Date: 2015-11-27 16:40
FWIW, the actual behavior of \w matching "everything in Unicode general categories L* and N*, plus U+005F (underscore)" is consistent across all versions I can conveniently test (2.7, 3.4, 3.5). In 2.7, there are four characters in general category Nl that \w doesn't match, but I believe that is just a bug, not an intentional difference of behavior.
msg407440 - (view)	Author: Irit Katriel (iritkatriel) *	Date: 2021-12-01 11:03
It's too late for the 2.7 docs, but the current docs can still be updated.
msg414180 - (view)	Author: Stanley (slateny) *	Date: 2022-02-28 07:39
Would a change like this be accurate? Matches Unicode word characters; this includes most alphanumeric characters as well as the underscore. In Unicode, alphanumeric characters are defined to be the general categories L + N (see https://unicode.org/reports/tr44/#General_Category_Values). If the :const:`ASCII` flag is used, only ``[a-zA-Z0-9_]`` is matched.

History
Date	User	Action	Args
2022-04-11 14:58:24	admin	set	github: 69929
2022-02-28 07:39:46	slateny	set	nosy: + slatenymessages: +
2021-12-01 11:03:39	iritkatriel	set	nosy: + iritkatrielversions: + Python 3.9, Python 3.10, Python 3.11, - Python 2.7, Python 3.5, Python 3.6messages: + keywords: + easytitle: Clarify exactly what \w matches in UNICODE mode -> [doc] Clarify exactly what \w matches in UNICODE mode
2016-01-04 03:52:01	ezio.melotti	set	versions: - Python 3.2, Python 3.3, Python 3.4nosy: + ezio.melotti, mrabarnettcomponents: + Regular Expressionstype: enhancementstage: needs patch
2015-11-27 16:40:30	zwol	set	messages: +
2015-11-27 16:14:25	Andi McClure	set	nosy: + Andi McCluremessages: +
2015-11-27 15:50:58	zwol	create