(original) (raw)

On Fri, Aug 26, 2011 at 5:59 AM, Guido van Rossum <guido@python.org> wrote:

On Thu, Aug 25, 2011 at 7:28 PM, Isaac Morland <ijmorlan@uwaterloo.ca> wrote:
> On Thu, 25 Aug 2011, Guido van Rossum wrote:
>
>> I'm not sure what should happen with UTF-8 when it (in flagrant
>> violation of the standard, I presume) contains two separately-encoded
>> surrogates forming a valid surrogate pair; probably whatever the UTF-8
>> codec does on a wide build today should be good enough.

Surrogates are used and valid only in UTF-16.
In UTF-8/32 they are invalid, even if they are in pair (see http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf ). Of course Python can/should be able to represent them internally regardless of the build type.

>>Similarly for
>> encoding to UTF-8 on a wide build if one managed to create a string
>> containing a surrogate pair. Basically, I'm for a
>> garbage-in-garbage-out approach (with separate library functions to
>> detect garbage if the app is worried about it).
>
> If it's called UTF-8, there is no decision to be taken as to decoder
> behaviour - any byte sequence not permitted by the Unicode standard must
> result in an error (although, of course, \*how\* the error is to be reported
> could legitimately be the subject of endless discussion).

What do you mean? We use the "strict" error handler by default and we can specify other handlers already.

There are
> security implications to violating the standard so this isn't just
> legalistic purity.

You have a point. The security issues cannot be seen separate from all
the other issues. The folks inside Google who care about Unicode often
harp on this. So I stand corrected. I am fine with codecs treating
code points or code point sequences that the Unicode standard doesn't
like (e.g. lone surrogates) the same way as more severe errors in the
encoded bytes (lots of byte sequences already aren't valid UTF-8).

Codecs that use the official names should stick to the standards. For example s.encode('utf-32') should either produce a valid utf-32 byte string or raise an error if 's' contains invalid characters (e.g. surrogates).
We can have other internal codecs that are based on the UTF-\* encodings but allow the representation of lone surrogates and even expose them if we want, but they should have a different name (even 'utf-\*-something' should be ok, see http://bugs.python.org/issue12729#msg142053 from "Unicode says you can't put surrogates or noncharacters in a UTF-anything stream.").

I
just hope this doesn't require normal forms or other expensive
operations; I hope it's limited to rejecting invalid use of surrogates
or other values that are not valid code points (e.g. 0, or >= 2\*\*21).

I think there shouldn't be any normalization done automatically by the codecs.

> Hmmm, doesn't look good:
>
> Python 2.6.1 (r261:67515, Jun 24 2010, 21:47:49)
> \[GCC 4.2.1 (Apple Inc. build 5646)\] on darwin
> Type "help", "copyright", "credits" or "license" for more information.
>>>>
>>>> '\\xed\\xb0\\x80'.decode ('utf-8')
>
> u'\\udc00'
>>>>
>
> Incorrect! Although this is a narrow build - I can't say what the wide
> build would do.

The UTF-8 codec used to follow RFC 2279 and only recently has been updated to RFC 3629 (see http://bugs.python.org/issue8271#msg107074 ). On Python 2.x it still produces invalid UTF-8 because changing it is backward incompatible. In Python 2 UTF-8 can be used to encode every codepoint from 0 to 10FFFF, and it always works. If we change it now it might start raising errors for an operation that never raised them before (see http://bugs.python.org/issue12729#msg142047 ).
Luckily this is fixed in Python 3.x.
I think there are more codepoints/byte sequences that should be rejected while encoding/decoding though, in both UTF-8 and UTF-16/32, but I haven't looked at them yet (I would be happy to fix these for 3.3 or even 2.7/3.2 (if applicable), so if you find mismatches with the Unicode standard and report an issue, feel free to assign it to me).

Best Regards,
Ezio Melotti

>
> For reasons of practicality, it may be appropriate to provide easy access to
> a CESU-8 decoder in addition to the normal UTF-8 decoder, but it must not be
> called UTF-8\. Other variations may also find use if provided.
>
> See UTF-8 RFC: http://www.ietf.org/rfc/rfc3629.txt
>
> And CESU-8 technical report: http://www.unicode.org/reports/tr26/

Thanks for the links! I also like the term "supplemental character" (a
code point >= 2\*\*16). And I note that they talk about characters were
we've just agreed that we should say code points...

\--
\--Guido van Rossum (python.org/\~guido)

\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_