[Python-3000] Unicode and OS strings (original) (raw)

Stephen J. Turnbull stephen at xemacs.org
Thu Sep 13 20:43:59 CEST 2007

Previous message: [Python-3000] Unicode and OS strings
Next message: [Python-3000] Unicode and OS strings
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

"Martin v. Löwis" writes:

One "universal" solution is to use Unicode private-use-area characters.

Of course, if the input data already contains PUA characters, there would be an ambiguity.

That may be true in the implementation, but it shouldn't. What should happen internally is that all undecodable characters (which PUA characters are by definition for standard codecs) are mapped to unused codepoints in the PUA, chosen by Python.

This map would be required to maintain some house-keeping information about where the character came from (specificially the original coded character set so that round-tripping would succeed).

One possible error-recovery strategy for broken encodings (as opposed to coding which is correct in format but contains a code point not in the table) would be to have a "pure code unit" block in the PUA.

Note that since we're talking about code units throughout (there's no guarantee that the encoding in question is octet-oriented, although that's almost always the case in practice), 256 code points may not be enough.

We would make a list of all interfaces that use the PUA error handler: file names, environment variables, command line arguments.

In general, I don't consider this an error. It's reasonable to use exception handling internally to the codec -- such broken texts are rare except in interactive applications where the speed isn't an issue -- but for some applications it would be useful to accept entire broken strings and pass them to Python with the broken parts marked (ie, by being assigned to the "code unit" block of the PUA) and the rest decoded.

Here's an example that comes up in Emacs (specifically AUCTeX). TeX error messages are octet-oriented and regularly slice multibyte encodings in the middle of characters or escape sequences. It turns out the basic codec algorithms often DTRT by (accidentally) resynchronizing on ASCII, and sometimes can even resynch on a multibyte character. So the display of the "broken" text is often useful. However, for reasons I'm not familiar with the AUCTeX developers have asked that the strings be invertible (ie, back to the octets that TeX spit out). This scheme would allow that.

Previous message: [Python-3000] Unicode and OS strings
Next message: [Python-3000] Unicode and OS strings
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-3000 mailing list