[Python-3000] Unicode and OS strings (original) (raw)
Marcin 'Qrczak' Kowalczyk qrczak at knm.org.pl
Tue Sep 18 11:12:19 CEST 2007
- Previous message: [Python-3000] Unicode and OS strings
- Next message: [Python-3000] Unicode and OS strings
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Dnia 18-09-2007, Wt o godzinie 13:08 +0900, Stephen J. Turnbull napisaĆ(a):
> This is wrong: UTF-8 is specified for PUA. PUA is no special from the > point of view of UTF-8.
It is from the point of view of the Unicode standard, specifically v5. Please see section 16.5, especially about the "corporate use subarea".
It is not. 16.5 doesn't say anything about UTF-8, and UTF-8 is already specified for PUA.
> UTF-8 is defined for all Unicode scalar values,
Sure, and what I propose is entirely compatible with the specification of UTF-8 as a UTF,
It is not. In UTF-8 '\ue650' is b'\xEE\x99\x90', in your proposal it might be encoded as a single byte.
> "C10. When a process interprets a code unit sequence which purports to > be in a Unicode character encoding form, it shall treat ill-formed code > unit sequences as an error condition and shall not interpret such > sequences as characters."
Yeah, that's the one. While I'm uncomfortable advocating the position that my proposal is entirely compatible with C10,
It is not. Elements of PUA are characters.
it is arguable that "mapping code units to characters in private space" is not the same as "interpreting them as characters".
It's not the same, but interpreting as characters in PUA is obviously interpreting as characters.
chibi:MacPorts steve$ python -c 'import sys; print("%x" % ord(sys.argv[1]))' $(printf "\ue650") Traceback (most recent call last): File "", line 1, in ? TypeError: ord() expected a character, but string of length 6 found
I meant Python3 where sys.argv is a list of Unicode strings. It should work out of the box.
Why length 6? "\ue650" encoded in UTF-8 has length 3.
For an old discussion about using PUA to represent bytes undecodable as UTF-8, see http://www.mail-archive.com/unicode@unicode.org/ and subthreads with "roundtripping" in the subject.
-- _("< Marcin Kowalczyk _/ qrczak at knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/
- Previous message: [Python-3000] Unicode and OS strings
- Next message: [Python-3000] Unicode and OS strings
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]