[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces (original) (raw)

Tony Nelson tonynelson at georgeanelson.com
Mon Apr 27 20:08:51 CEST 2009


At 23:39 -0700 04/26/2009, Glenn Linderman wrote:

On approximately 4/25/2009 5:35 AM, came the following characters from the keyboard of Martin v. Löwis:

Because the encoding is not reliably reversible.

Why do you say that? The encoding is completely reversible (unless we disagree on what "reversible" means).

I'm +1 on the concept, -1 on the PEP, due solely to the lack of a reversible encoding. Then please provide an example for a setup where it is not reversible. Regards, Martin It is reversible if you know that it is decoded, and apply the encoding. But if you don't know that has been encoded, then applying the reverse transform can convert an undecoded str that matches the decoded str to the form that it could have, but never did take. The problem is that there is no guarantee that the str interface provides only strictly conforming Unicode, so decoding bytes to non-strictly conforming Unicode, can result in a data pun between non-strictly conforming Unicode coming from the str interface vs bytes being decoded to non-strictly conforming Unicode coming from the bytes interface. ...

Maybe this is a dumb idea, but some people might be reassured if the half-surrogates had some particular pattern that is unlikely to occur even in unreasonable text (as half-surrogates are an error in Unicode). The pattern could be some sequence of half-surrogate encoded bytes, framing the intended data, as is done for RFC 2047 internationalized header fields in email. It would take up a few more bytes in the string, but no matter. It would also make it easier to diagnose when decoding was not properly done.

FWIW, I like the idea in the PEP, now that I think I understand it.

(BTW, gotta love what the email package is doing to the Subject: header field. ;-')


TonyN.:' <mailto:tonynelson at georgeanelson.com> ' <http://www.georgeanelson.com/>



More information about the Python-Dev mailing list