[Python-3000] Unicode and OS strings (original) (raw)

Stephen J. Turnbull stephen at xemacs.org
Tue Sep 18 06:08:29 CEST 2007


"Marcin 'Qrczak' Kowalczyk" <qrczak at knm.org.pl> writes: Well, for any scheme which attempts to modify UTF-8 by accepting arbitrary byte strings is used, something must be interpreted differently than in real UTF-8.

Wrong. In my scheme everything ends up in the PUA, on which real UTF-8 imposes no interpretation by definition.

This is wrong: UTF-8 is specified for PUA. PUA is no special from the point of view of UTF-8.

It is from the point of view of the Unicode standard, specifically v5. Please see section 16.5, especially about the "corporate use subarea". (No, I hadn't considered this stuff yet in my proposal, but it's not hard to accomodate.)

UTF-8 is defined for all Unicode scalar values,

Sure, and what I propose is entirely compatible with the specification of UTF-8 as a UTF, unlike what you propose. Until you understand why that's true, we're at an impasse.

I haven't gone back to check yet, but it's possible that a "real UTF-8 conforming process" is required to stop processing and issue an error or something like that in the cases we're trying to handle.

"C10. When a process interprets a code unit sequence which purports to be in a Unicode character encoding form, it shall treat ill-formed code unit sequences as an error condition and shall not interpret such sequences as characters."

Yeah, that's the one.

While I'm uncomfortable advocating the position that my proposal is entirely compatible with C10, it is true that it treats ill-formed sequences as an error, and it is arguable that "mapping code units to characters in private space" is not the same as "interpreting them as characters". For obvious reasons I'm uncomfortable with that, but I actually don't consider this non-conformance a huge loss in the context of this thread since both your proposal and James Knight's do equally non-conformant things.



More information about the Python-3000 mailing list