[Python-Dev] \u and \U escapes in raw unicode string literals (original) (raw)

M.-A. Lemburg mal at egenix.com
Sun May 13 22:54:48 CEST 2007

Previous message: [Python-Dev] \u and \U escapes in raw unicode string literals
Next message: [Python-Dev] \u and \U escapes in raw unicode string literals
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 2007-05-13 18:04, Martin v. Löwis wrote:

* without the Unicode escapes, the only way to put non-ASCII code points into a raw Unicode string is via a source code encoding of say UTF-8 or UTF-16, pretty much defeating the original requirement of writing ASCII code only That's no problem, though - just don't put the Unicode character into a raw string. Use plain strings if you have a need to include Unicode characters, and are not willing to leave ASCII. For Python 3, the default source encoding is UTF-8, so it is much easier to use non-ASCII characters in the source code. The original requirement may not be as strong anymore as it used to be.

You can do that today: Just put the "# coding: utf-8" marker at the top of the file.

However, in some cases, your editor may not be capable of displaying or letting you enter the Unicode text you have in mind.

In other cases, there may be a corporate coding standard in place that prohibits using non-ASCII text in source code, or fixes the encoding to e.g. Latin-1.

In all those cases, it's necessary to be able to enter the Unicode code points which do cannot be used in the source code using other means and the easiest way to do this is by using Unicode escapes.

* non-ASCII code points in text are not uncommon, they occur in most European scripts, all Asian scripts, many scientific texts and in also texts meant for the web (just have a look at the HTML entities, or think of Word exports using quotes) And you are seriously telling me that people who commonly use non-ASCII code points in their source code are willing to refer to them by Unicode ordinal number (which, of course, they all know by heart, from 1 to 65536)?

No, I'm not. I'm saying that non-ASCII code points are in common use and (together with the above bullet) that there are situations where you can't put the relevant code point directly into your source code.

Using Unicode escapes for these will always be a cludge, but it's still better than not being able to enter the code points at all.

* adding Unicode escapes to the re module will break code already using "...\u..." in the regular expressions for other purposes; writing conversion tools that detect this usage is going to be hard It's unlikely to occur in code today - \u just means the same as u (so \u1234 matches u1234); if you want a backslash followed by u in your regular expression, you should write \u. It would be possible to future-warn about \u in 2.6, catching these cases. Authors then would either have to remove the backslash, or duplicate it, depending on what they want to express.

Good idea.

The re module would then have to implement the same escaping scheme as the raw-unicode-escape code (only an odd number of backslashes causes the escaping code to trigger).

-- Marc-Andre Lemburg eGenix.com

Professional Python Services directly from the Source (#1, May 13 2007)

Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/

:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! ::::

eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611

Previous message: [Python-Dev] \u and \U escapes in raw unicode string literals
Next message: [Python-Dev] \u and \U escapes in raw unicode string literals
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list