[Python-Dev] (Not) delaying the 3.2 release (original) (raw)

Martin (gzlist) gzlist at googlemail.com
Thu Sep 16 21:43:25 CEST 2010


On 16/09/2010, Guido van Rossum <guido at python.org> wrote:

On Thu, Sep 16, 2010 at 11:16 AM, Toshio Kuratomi <a.badger at gmail.com> wrote:

You were talking about encodings that were supersets of 7-bit ASCII. I think Martin was demonstrating a byte string that was a superset of 7-bit ASCII being fed to a stdlib function which went wrong. Whoops, sorry. I don't have access to Windows so I can't reproduce this though. I also don't understand it. What is the Unicode codepoint for that 十 character? What is sys.getfilesystemencoding()? What is the value of "C:\十".encode(sys.getfilesystemencoding())?

My fault, should have been clearer. I was trying to demonstrate that there's a difference between the unix-friendly encodings like UTF-8 and the EUC codecs which only use high-bit characters for non-ascii text, and the ISO-2022 codecs and Shift JIS.

In the example I gave, 十 encodes in CP932 as '\x8f\', and the function gets confused by the second byte. Obviously the right answer there is just to use unicode, rather than write a function that works with weird multibyte codecs.

Martin



More information about the Python-Dev mailing list