[Python-Dev] thoughts on the bytes/string discussion (original) (raw)

Ronald Oussoren ronaldoussoren at mac.com
Tue Jul 6 16:51:53 CEST 2010


On 27 Jun, 2010, at 11:48, Greg Ewing wrote:

Stefan Behnel wrote:

Greg Ewing, 26.06.2010 09:58:

Would there be any sanity in having an option to compile Python with UTF-8 as the internal string representation? It would break PyUNICODE, because the internal size of a unicode character would no longer be fixed. It's not fixed anyway with the 2-char build -- some characters are represented using a pair of surrogates.

It is for practical purposes not even fixed in 4-char builds. In 4-char builds every Unicode code points corresponds to one item in a python unicode string, but a base characters with combining characters is still a sequence of characters and should IMHO almost always be treated as a single object. As an example, given s="be\N{COMBINING DIAERESIS}" s[:2] or s[2:] is almost certainly semanticly invalid.

Ronald

-------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 3567 bytes Desc: not available URL: <http://mail.python.org/pipermail/python-dev/attachments/20100706/98171a82/attachment-0001.bin>



More information about the Python-Dev mailing list