[Python-Dev] surrogatepass - she's a witch, burn 'er! [was: Cleaning up ...] (original) (raw)

Greg Ewing greg.ewing at canterbury.ac.nz
Sat Aug 30 01:37:18 CEST 2014


M.-A. Lemburg wrote:

we needed a way to make sure that Python 3 also optionally supports working with lone surrogates in such UTF-8 streams (nowadays called CESU-8: http://en.wikipedia.org/wiki/CESU-8).

I don't think CESU-8 is the same thing. According to the wiki page, CESU-8 requires all code points above 0xffff to be split into surrogate pairs before encoding. It also doesn't say that lone surrogates are valid -- it doesn't mention lone surrogates at all, only pairs. Neither does the linked technical report.

The technical report also says that CESU-8 forbids any UTF-8 sequences of more than three bytes, so it's definitely not "UTF-8 plus lone surrogates".

-- Greg



More information about the Python-Dev mailing list