[Python-3000] Unicode and OS strings (original) (raw)

Marcin 'Qrczak' Kowalczyk qrczak at knm.org.pl
Mon Sep 17 21:12:00 CEST 2007


Dnia 15-09-2007, So o godzinie 09:13 +0900, Stephen J. Turnbull napisaƂ(a):

> Well, for any scheme which attempts to modify UTF-8 by accepting > arbitrary byte strings is used, something must be interpreted > differently than in real UTF-8.

Wrong. In my scheme everything ends up in the PUA, on which real UTF-8 imposes no interpretation by definition.

This is wrong: UTF-8 is specified for PUA. PUA is no special from the point of view of UTF-8. UTF-8 is defined for all Unicode scalar values, i.e. all code points in the ranges U+0000..U+D7FF and U+E000..U+10FFFF, i.e. all code points excluding surrogates. This includes PUA.

I haven't gone back to check yet, but it's possible that a "real UTF-8 conforming process" is required to stop processing and issue an error or something like that in the cases we're trying to handle.

"C10. When a process interprets a code unit sequence which purports to be in a Unicode character encoding form, it shall treat ill-formed code unit sequences as an error condition and shall not interpret such sequences as characters."

-- _("< Marcin Kowalczyk _/ qrczak at knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/



More information about the Python-3000 mailing list