[Python-Dev] Multilingual programming article on the Red Hat Developer blog (original) (raw)
Stephen J. Turnbull stephen at xemacs.org
Wed Sep 17 08:32:00 CEST 2014
- Previous message: [Python-Dev] Multilingual programming article on the Red Hat Developer blog
- Next message: [Python-Dev] Multilingual programming article on the Red Hat Developer blog
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Steven D'Aprano writes:
[long example]
Am I right so far?
So the email package uses the surrogate-escape error handler and ends up with this Unicode string:
'Subject: \udc9c\udc80\udce2NOBODY expects the Spanish Inquisition!”'
which can be encoded back to the bytes we started with.
Yes.
Note that technically those three \u... code points are NOT classified as "noncharacters".
Very unpythonic terminology, easily confusing the nonspecialist. Or the specialist -- I used to know that Unicode gave "noncharacter" a technical definition but it seems I forgot. But then, Unicode isn't a PSF product, so I guess it's OK to be unpythonic.
They are actually surrogate code points:
http://www.unicode.org/faq/private_use.html#nonchar4 http://www.unicode.org/glossary/#surrogate_code_point
and they're supposed to be reserved for UTF-16. I'm not sure of the implication of that.
It means that any Python program that invokes the surrogateescape handler is not a "conforming Unicode process", at least not on the naive interpretation of that definition. A conforming process would interpret them as corrupt characters and raise as soon as detected.
A more sophisticated interpretation might argue that Python is multiple processes (in the sense of "process" used by Unicode), and that the Unicode standard only applies to characters. This is especially true of Pythons implementing PEP 393, since no surrogates should ever appear in text[1] at all. Then the smuggled bytes can be treated as noncharacters in practice although technically it's a violation of the Unicode standard to do so.
Footnotes: [1] Meaning, no fair using chr() to inject them into str!
- Previous message: [Python-Dev] Multilingual programming article on the Red Hat Developer blog
- Next message: [Python-Dev] Multilingual programming article on the Red Hat Developer blog
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]