[Python-Dev] \u and \U escapes in raw unicode string literals (original) (raw)

Ron Adam rrr at ronadam.com
Fri May 11 09:59:42 CEST 2007


Martin v. Löwis wrote:

This is what prompted my question, actually: in Py3k, in the str/unicode unification branch, r"\u1234" changes meaning: before the unification, this was an 8-bit string, where the \u was not special, but now it is a unicode string, where \u is special. That is true for non-raw strings also: the meaning of "\u1234" also changes. However, traditionally, there was no escaping mechanism in raw strings in Python, and I feel that this is a good principle, because it is easy to learn (if you leave out the detail that \ can't be the last character in a raw string - which should get fixed also, IMO). So I think in Py3k, "\u1234" should continue to be a string with 6 characters. Otherwise, people will complain that os.stat(r"c:\windows\system32\user32.dll") fails. Telling them to write os.stat(r"c:\windows\system32\u005Cuser32.dll") will just cause puzzled faces. Windows path names are one of the two primary applications of raw strings (the other being regexes).

I think regular expressions become easier to read if they don't also contain python escape characters because then you don't have to mentally parse which ones are part of the regular expression and which ones are evaluated by python. The re module can still evaluate r"\uxxxx", r"'", and r'"' sequences even if python doesn't.

I experimented with tokanize.c to see if the trailing '' could be special cased in raw strings. The minimum change I could come up with was to have it not respect slash-quote sequences, (for finding the end of a string), if the quote is the same type as the quote used to define the string. The following strings in the library needed to be adjusted after that change.

I don't think this is the best solution, but the list of strings needing changed might be useful for the discussion.

-_declstringlit_match = re.compile(r'('[^']'|"[^"]")\s*').match +_declstringlit_match = re.compile(r'''('[^']'|"[^"]")\s*''').match

end-of-quote

-HEADER_QUOTED_VALUE_RE = re.compile(r"^\s*=\s"([^"\](?:\.[^"\]))"") +HEADER_QUOTED_VALUE_RE = re.compile(r'''^\s*=\s"([^"\](?:\.[^"\]))"''')

-HEADER_JOIN_ESCAPE_RE = re.compile(r"(["\])") +HEADER_JOIN_ESCAPE_RE = re.compile(r'(["\])')

'unicode-escape')

I also noticed that python handles the '' escape character differently than re does in regular strings. In regular expressions, a single '' is always an escape character. If the following character is not a special character, then the two character combination becomes the second non-special character.

 "\'"  --> '
 "\\"  --> \
 "\q"  --> q  ('q' not special so '\q' is 'q')

This isn't how python does it.

''' "'" "\" '\' "\q" ('q' not special, so Back slash is not an escape.) '\q'

So it might be good to have it always be an escape in regular strings, and never be an escape in raw strings.

Ron



More information about the Python-Dev mailing list