[Python-Dev] \u and \U escapes in raw unicode string literals (original) (raw)

"Martin v. Löwis" martin at v.loewis.de
Sun May 13 18:04:44 CEST 2007


* without the Unicode escapes, the only way to put non-ASCII code points into a raw Unicode string is via a source code encoding of say UTF-8 or UTF-16, pretty much defeating the original requirement of writing ASCII code only

That's no problem, though - just don't put the Unicode character into a raw string. Use plain strings if you have a need to include Unicode characters, and are not willing to leave ASCII.

For Python 3, the default source encoding is UTF-8, so it is much easier to use non-ASCII characters in the source code. The original requirement may not be as strong anymore as it used to be.

* non-ASCII code points in text are not uncommon, they occur in most European scripts, all Asian scripts, many scientific texts and in also texts meant for the web (just have a look at the HTML entities, or think of Word exports using quotes)

And you are seriously telling me that people who commonly use non-ASCII code points in their source code are willing to refer to them by Unicode ordinal number (which, of course, they all know by heart, from 1 to 65536)?

* adding Unicode escapes to the re module will break code already using "...\u..." in the regular expressions for other purposes; writing conversion tools that detect this usage is going to be hard

It's unlikely to occur in code today - \u just means the same as u (so \u1234 matches u1234); if you want a backslash followed by u in your regular expression, you should write \u.

It would be possible to future-warn about \u in 2.6, catching these cases. Authors then would either have to remove the backslash, or duplicate it, depending on what they want to express.

Regards, Martin



More information about the Python-Dev mailing list