[Python-3000] Unicode and OS strings (original) (raw)

Stephen J. Turnbull stephen at xemacs.org
Tue Sep 18 22:36:41 CEST 2007


"Marcin 'Qrczak' Kowalczyk" <qrczak at knm.org.pl> writes:

This is wrong: UTF-8 is specified for PUA. PUA is no special from the point of view of UTF-8.

It is from the point of view of the Unicode standard, specifically v5. Please see section 16.5, especially about the "corporate use subarea".

It is not. 16.5 doesn't say anything about UTF-8, and UTF-8 is already specified for PUA.

There's no UTF-8 in Python's internal string encoding. What are you talking about?

Sure, and what I propose is entirely compatible with the specification of UTF-8 as a UTF,

It is not. In UTF-8 '\ue650' is b'\xEE\x99\x90', in your proposal it might be encoded as a single byte.

Of course not; the point of the proposal is to ensure that all text can be round-tripped through Python's internal representation. Anything that comes in as a character through a codec using my exception handler will be the same character when output with that handler. Again, what are you talking about?

While I'm uncomfortable advocating the position that my proposal is entirely compatible with C10,

It is not. Elements of PUA are characters.

Yes. Where did I say anything else?

It's not the same, but interpreting as characters in PUA is obviously interpreting as characters.

No. Internally mapping to characters in PUA is mapping. Unicode does not try to restrict internal processing, only behavior at process boundaries. Interpretation as characters happens only on output.

I do not yet know how to prevent that (or even if I can, it may be practically impossible because of important cases where the internal representation is exchanged between processes). If it can't be prevented while maintaining efficiency, that is a major flaw (but not necessarily fatal, since I'm proposing an exception handler, not a required feature of Unicode codecs).

I meant Python3 where sys.argv is a list of Unicode strings. It should work out of the box.

I really don't think so. Exposing internal representations as you are doing here is your problem; it is not something that Python should attempt to guarantee will work.

More troublesome from your point of view, Guido has stated that the internal representation used by Python strings is a sequence of Unicode code units, not characters. I don't think that's reached the status of "pronouncement" yet, but you will probably need a PEP to get the guarantees you want.

Why length 6? "\ue650" encoded in UTF-8 has length 3.

MS UTF-8, I suppose. You see, you simply cannot depend on any particular Python string being translated to a particular Unicode representation unless you choose the codec explicitly. Since you have to specify that codec to be reliable anyway, I don't see much loss here except to lazy programmers willing to live dangerously. But that's not true of anybody in this thread! The whole point is to preserve even broken input for later forensic analysis.

For an old discussion about using PUA to represent bytes undecodable as UTF-8, see http://www.mail-archive.com/unicode@unicode.org/ and subthreads with "roundtripping" in the subject.

Which (after a half hour of looking) are mostly irrelevant, because Mr. Kristan's proposal (I assume that's what you're talking about) as far as I can see involved standardizing such representations within Unicode. We're not talking about that here; we're talking about representations internal to Python, for the convenience of Python users.



More information about the Python-3000 mailing list