[Python-Dev] Unicode exception indexing (original) (raw)

Guido van Rossum guido at python.org
Thu Nov 3 22:09:37 CET 2011


On Thu, Nov 3, 2011 at 12:29 PM, Antoine Pitrou <solipsis at pitrou.net> wrote:

On Thu, 03 Nov 2011 18:14:42 +0100 martin at v.loewis.de wrote:

There is a backwards compatibility issue with PEP 393 and Unicode exceptions: the start and end indices: are they PyUNICODE indices, or code point indices?

On the one hand, these indices are used in formatting error messages such as "codec can't encode character \u%04x in position %d", suggesting they are regular indices into the string (counting code points). On the other hand, they are used by error handlers to lookup the character, and existing error handlers (including the ones we have now) use PyUnicodeAsUnicode to find the character. This suggests that the indices should be PyUNICODE indices, for compatibility (and they currently do work in this way). But what about error handlers written in Python? The indices can only be different if the string is an UCS-4 string, and PyUNICODE is a two-byte type (i.e. on Windows). So what should it be? I'd say let's do the Right Thing and accept the small compatibility breach (surrogates on UCS-2 builds).

+1

-- --Guido van Rossum (python.org/~guido)



More information about the Python-Dev mailing list