[Python-Dev] Multilingual programming article on the Red Hat Developer blog (original) (raw)

Chris Angelico rosuav at gmail.com
Tue Sep 16 05:51:23 CEST 2014


On Tue, Sep 16, 2014 at 1:34 PM, Stephen J. Turnbull <stephen at xemacs.org> wrote:

Jim J. Jewett writes:

> In terms of best-effort, it is reasonable to treat the smuggled bytes > as representing a character outside of your unicode repertoire I have to disagree. If you ever end up passing them to something that validates or tries to reencode them without surrogateescape, BOOM! These things are the text equivalent of IEEE NaNs. If all you know (as in the stdlib) is that you have "generic text", the only fairly safe things to do with them are (1) delete them, (2) substitute an appropriate replacement character for them, (3) pass the text containing them verbatim to other code, and (4) reencode them using the same codec they were read with.

Don't forget, these are errors. These are bytes that cannot be correctly decoded. That's not something that has any meaning whatsoever in text; so by definition, the only things you can do are the four you list there (as long as "codec" means both the choice of encoding and the use of the surrogateescape flag). It's like dealing with control characters when you need to print something visually, except that they have an official solution [1] and surrogateescape is unofficial. They're not real text, so you have to deal with them somehow.

The bytes might each represent one character. Several of them together might represent a single character. Or maybe they don't mean anything at all, and they're just part of a chunked data format... like I was finding in the .cwk files that I was reading this weekend (it's mostly MacRoman encoding, but the text is divided into chunks separated by \0\0 and two more bytes - turns out the bytes are chunk lengths, so they don't mean any sort of characters at all). You can't know.

ChrisA

[1] http://www.unicode.org/charts/PDF/U2400.pdf



More information about the Python-Dev mailing list