[Python-Dev] Multilingual programming article on the Red Hat Developer blog (original) (raw)

Jim J. Jewett jimjjewett at gmail.com
Fri Sep 12 17:37:59 CEST 2014


On September 11, 2014, Jeff Allen wrote:

... the area of code point space used for the smuggling of bytes under PEP-383 is not a "Unicode Private Use Area", but a portion of the trailing surrogate range. This is a code violation, which I imagine is why "surrogateescape" is an error handler, not a codec.

True, but I believe that is a CPython implementation detail.

Other implementations (including jython) should implement the surrogatescape API, but I don't think it is important to use the same internal representation for the invalid bytes.

(Well, unless you want to communicate with external tools (GUIs?) that are trying to directly use (effectively bytes rather than strings) in that particular internal encoding when communicating with python.)

lone surrogates preclude a naive use of the platform string library

Invalid input often causes problems. Are you saying that there are situations where the platform string library could easily handle invalid characters in general, but has a problem with the specific case of lone surrogates?

-jJ

--

If there are still threading problems with my replies, please email me with details, so that I can try to resolve them. -jJ



More information about the Python-Dev mailing list