[Python-Dev] PEP 393 Summer of Code Project (original) (raw)

Ezio Melotti ezio.melotti at gmail.com
Fri Aug 26 11:14:13 CEST 2011

Previous message: [Python-Dev] PEP 393 Summer of Code Project
Next message: [Python-Dev] PEP 393 Summer of Code Project
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Fri, Aug 26, 2011 at 5:59 AM, Guido van Rossum <guido at python.org> wrote:

On Thu, Aug 25, 2011 at 7:28 PM, Isaac Morland <ijmorlan at uwaterloo.ca> wrote: > On Thu, 25 Aug 2011, Guido van Rossum wrote: > >> I'm not sure what should happen with UTF-8 when it (in flagrant >> violation of the standard, I presume) contains two separately-encoded >> surrogates forming a valid surrogate pair; probably whatever the UTF-8 >> codec does on a wide build today should be good enough.

Surrogates are used and valid only in UTF-16. In UTF-8/32 they are invalid, even if they are in pair (see http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf ). Of course Python can/should be able to represent them internally regardless of the build type.

Similarly for >> encoding to UTF-8 on a wide build if one managed to create a string >> containing a surrogate pair. Basically, I'm for a >> garbage-in-garbage-out approach (with separate library functions to >> detect garbage if the app is worried about it). > > If it's called UTF-8, there is no decision to be taken as to decoder > behaviour - any byte sequence not permitted by the Unicode standard must > result in an error (although, of course, how the error is to be reported > could legitimately be the subject of endless discussion).

What do you mean? We use the "strict" error handler by default and we can specify other handlers already.

There are > security implications to violating the standard so this isn't just > legalistic purity.

You have a point. The security issues cannot be seen separate from all the other issues. The folks inside Google who care about Unicode often harp on this. So I stand corrected. I am fine with codecs treating code points or code point sequences that the Unicode standard doesn't like (e.g. lone surrogates) the same way as more severe errors in the encoded bytes (lots of byte sequences already aren't valid UTF-8).

Codecs that use the official names should stick to the standards. For example s.encode('utf-32') should either produce a valid utf-32 byte string or raise an error if 's' contains invalid characters (e.g. surrogates). We can have other internal codecs that are based on the UTF-* encodings but allow the representation of lone surrogates and even expose them if we want, but they should have a different name (even 'utf-*-something' should be ok, see http://bugs.python.org/issue12729#msg142053 from "Unicode says you can't put surrogates or noncharacters in a UTF-anything stream.").

I just hope this doesn't require normal forms or other expensive operations; I hope it's limited to rejecting invalid use of surrogates or other values that are not valid code points (e.g. 0, or >= 2**21).

I think there shouldn't be any normalization done automatically by the codecs.

> Hmmm, doesn't look good: > > Python 2.6.1 (r261:67515, Jun 24 2010, 21:47:49) > [GCC 4.2.1 (Apple Inc. build 5646)] on darwin > Type "help", "copyright", "credits" or "license" for more information. >>>> >>>> '\xed\xb0\x80'.decode ('utf-8') > > u'\udc00' >>>> > > Incorrect! Although this is a narrow build - I can't say what the wide > build would do.

The UTF-8 codec used to follow RFC 2279 and only recently has been updated to RFC 3629 (see http://bugs.python.org/issue8271#msg107074 ). On Python 2.x it still produces invalid UTF-8 because changing it is backward incompatible. In Python 2 UTF-8 can be used to encode every codepoint from 0 to 10FFFF, and it always works. If we change it now it might start raising errors for an operation that never raised them before (see http://bugs.python.org/issue12729#msg142047 ). Luckily this is fixed in Python 3.x. I think there are more codepoints/byte sequences that should be rejected while encoding/decoding though, in both UTF-8 and UTF-16/32, but I haven't looked at them yet (I would be happy to fix these for 3.3 or even 2.7/3.2 (if applicable), so if you find mismatches with the Unicode standard and report an issue, feel free to assign it to me).

Best Regards, Ezio Melotti

> For reasons of practicality, it may be appropriate to provide easy access to > a CESU-8 decoder in addition to the normal UTF-8 decoder, but it must not be > called UTF-8. Other variations may also find use if provided. > > See UTF-8 RFC: http://www.ietf.org/rfc/rfc3629.txt > > And CESU-8 technical report: http://www.unicode.org/reports/tr26/ Thanks for the links! I also like the term "supplemental character" (a code point >= 2**16). And I note that they talk about characters were we've just agreed that we should say code points... -- --Guido van Rossum (python.org/~guido <http://python.org/%7Eguido>)

-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.python.org/pipermail/python-dev/attachments/20110826/928797bf/attachment.html>

Previous message: [Python-Dev] PEP 393 Summer of Code Project
Next message: [Python-Dev] PEP 393 Summer of Code Project
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list