[Python-Dev] surrogatepass - she's a witch, burn 'er! (original) (raw)
Stephen J. Turnbull stephen at xemacs.org
Sat Aug 30 06:21:56 CEST 2014
- Previous message: [Python-Dev] surrogatepass - she's a witch, burn 'er! [was: Cleaning up ...]
- Next message: [Python-Dev] surrogatepass - she's a witch, burn 'er! [was: Cleaning up ...]
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Greg Ewing writes:
M.-A. Lemburg wrote:
we needed a way to make sure that Python 3 also optionally supports working with lone surrogates in such UTF-8 streams (nowadays called CESU-8: http://en.wikipedia.org/wiki/CESU-8).
Besides what Greg says, CESU-8 is an UTF, and therefore encodes valid Unicode. Speaking imprecisely, CESU-8 is UTF-16 with variable-width code units (ie, each 16-bit code point is represented using the UTF-8 variable-width representation).[1]
I think you are thinking of Markus Kuhn's utf-8b (which I believe is exactly what is implemented by the surrogateescape handler).
As far as the goal of "working with lone surrogates in such UTF-8 streams", the surrogateescape handler already permits that, and does so consistently across streams in the sense that lone surrogates in the UTF-8 stream cannot be mixed with garbage bytes decoded by surrogateescape in another stream, which produces an unencodable mess.
I still don't see a justification for the surrogatepass handler. What applications are producing (not merely passing through) UTF-8-encoded surrogates these days?
Footnotes: [1] For the curious, it's imprecise because in Unicode code units are fixed-width by definition.
- Previous message: [Python-Dev] surrogatepass - she's a witch, burn 'er! [was: Cleaning up ...]
- Next message: [Python-Dev] surrogatepass - she's a witch, burn 'er! [was: Cleaning up ...]
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]