[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces (original) (raw)

"Martin v. Löwis" martin at v.loewis.de
Sat Apr 25 17:05:23 CEST 2009


The only drawback I can see is if the UTF-8 bytes actually decode to a half surrogate. However, half surrogates should really only occur in UTF-16 (as I understand it), so they shouldn't be encoded in UTF-8 anyway!

Right: that's the rationale for UTF-8b. Encoding half surrogates violates parts of the Unicode spec, so UTF-8b is "safe".

As for handling this case, you could either:

1. Raise an exception (which is what you're trying to avoid) or: 2. Treat it as invalid UTF-8 and map the bytes to half surrogates (encoding would produce the original bytes). I'd prefer option 2.

I hadn't thought of this case, but you are right - they are illegal bytes, after all. Raising an exception would be useless since the whole point of this codec is to never raise unicode errors.

Regards, Martin



More information about the Python-Dev mailing list