[Python-Dev] UCS2/UCS4 default (original) (raw)

Daniel Arbuckle djarb at highenergymagic.org
Thu Jul 3 15:14:47 CEST 2008


On Thu, Jul 3, 2008 at 5:39 AM, Nick Coghlan <ncoghlan at gmail.com> wrote:

1. If you are advocating disallowing the use of characters outside the BMP in a UCS-2 build, enumerate the advantages of doing so (paying particular attention to any advantages which cannot be obtained simply by using an appropriate codec that disallows non-BMP characters).

Right now, the same python code has different meaning, depending on a compile-time option that most users didn't even set for themselves. Moreover, the errors caused by this semantic difference are not reported. There's just no way to justify that.

You can't solve this problem by saying 'programmers should choose a codec that limits them to the BMP when they target 2-byte python,' because the problem specifically arises when code that works correctly in a 4-byte python is placed into a 2-byte python, an operation performed by the users rather than by programmers.

Since 2-byte python is apparently a holdover for memory-limited (and presumably CPU-limited as well) systems, it doesn't make sense to impose on it the requirement of correctly dealing with surrogate pairs. Given that, it seems to me that the best solution would be to make 4-byte python the default, and also to make 2-byte python raise an exception when it encounters characters outside the BMP. This way, a mysterious and unreported semantic error becomes an explicit syntactic error.

For programmers who want to target a 2-byte format (for win32 compatibility, for example), the correct choice of codec is a superior solution to forcing a 2-byte internal representation on python.



More information about the Python-Dev mailing list