[Python-Dev] Multilingual programming article on the Red Hat Developer blog (original) (raw)

R. David Murray rdmurray at bitdance.com
Tue Sep 16 17:00:32 CEST 2014


On Tue, 16 Sep 2014 13:51:23 +1000, Chris Angelico <rosuav at gmail.com> wrote:

On Tue, Sep 16, 2014 at 1:34 PM, Stephen J. Turnbull <stephen at xemacs.org> wrote: > Jim J. Jewett writes: > > > In terms of best-effort, it is reasonable to treat the smuggled bytes > > as representing a character outside of your unicode repertoire > > I have to disagree. If you ever end up passing them to something that > validates or tries to reencode them without surrogateescape, BOOM! > These things are the text equivalent of IEEE NaNs. If all you know > (as in the stdlib) is that you have "generic text", the only fairly > safe things to do with them are (1) delete them, (2) substitute an > appropriate replacement character for them, (3) pass the text > containing them verbatim to other code, and (4) reencode them using > the same codec they were read with.

Don't forget, these are errors. These are bytes that cannot be correctly decoded. That's not something that has any meaning whatsoever in text; so by definition, the only things you can do are the four you list there (as long as "codec" means both the choice of encoding and the use of the surrogateescape flag). It's like dealing with control characters when you need to print something visually, except that they have an official solution [1] and surrogateescape is unofficial. They're not real text, so you have to deal with them somehow.

That isn't the case in the email package. The smuggled bytes are not errors[*], they are literally smuggled bytes. But, as Stephen said, the only things email does with them are the last three of the four he listed (if you read (3) as passing it between parts of the email package): the data comes in as text mixed with binary, and the email package parses it until it knows what the binary is supposed to be, turns it back into bytes, and decodes it properly. The goal is to never let the smuggled bytes escape out the email APIs as surrogateescape encoded text; though, in practice, this being consenting-adults Python and code not being bug free, there are places where people have used the knowledge of how surrogateescape is used by email to work around both API and code bugs.

--David

[*] Some of the encoded bytes are errors (non-ascii in headers or undecodable bytes in whatever the CTE/charset is), and in that case email may just turn them back into error bytes in the output, but only some of the smuggled bytes are actually errors (and none are if the message is RFC compliant).



More information about the Python-Dev mailing list