[Python-Dev] Multilingual programming article on the Red Hat Developer blog (original) (raw)

Stephen J. Turnbull stephen at xemacs.org
Tue Sep 16 05:34:36 CEST 2014


Jim J. Jewett writes:

In terms of best-effort, it is reasonable to treat the smuggled bytes as representing a character outside of your unicode repertoire

I have to disagree. If you ever end up passing them to something that validates or tries to reencode them without surrogateescape, BOOM! These things are the text equivalent of IEEE NaNs. If all you know (as in the stdlib) is that you have "generic text", the only fairly safe things to do with them are (1) delete them, (2) substitute an appropriate replacement character for them, (3) pass the text containing them verbatim to other code, and (4) reencode them using the same codec they were read with.

-- so it won't ever match entirely valid strings, except perhaps via a wildcard. And it should still work for .endswith().

Incorrect, I'm pretty sure, unless you know that both texts containing were read with the same codec. Eg, consider two filenames encoded in ISO Cyrillic and ISO Hebrew, read with (encoding='ascii', errors='surrogateescape').

Apps that know the semantics of the text may DWIM/DTRT if they want to, but FWIW-IMHO-YMMV-and-any-other-4-letter-caveat-acronyms-that- may-apply Python and the stdlib shouldn't try to guess.

Guessing may be unavoidable, of course.



More information about the Python-Dev mailing list