[Python-Dev] please consider changing --enable-unicode default to ucs4 (original) (raw)

Zooko O'Whielacronx zookog at gmail.com
Sun Sep 20 16:02:09 CEST 2009


Dear Pythonistas:

This issue causes serious problems. Users occasionally get binaries built for a compatible Linux and Python version but with a different UCS2-vs-UCS4 setting, and those users get mysterious memory corruption errors which are hard to diagnose. It is possible that these situations also open up security vulnerabilities. A couple such instances are documented on http://bugs.python.org/setuptools/issue78, but you can find more by googling. I would like to get this problem fixed!

In order to help address this issue I sampled what UCS size is used by python executables in the wild. I instrumented a few buildslaves that are contributed by various people to the Tahoe-LAFS project to print out their platform, python version, and sys.maxunicode. The full results are appended below. maxunicode: 1114111 means that python executable was configured with --enable-unicode=ucs4, and maxunicode: 65535 means that python executable was configured with --enable-unicode=ucs2 or just with --enable-unicode . The only incompatibilities that I found are because some packagers have deliberately set UCS4 configuration and other packagers have left the default setting.

In the three cases where someone configured python with UCS2, one of the three is certainly an accident (a custom-built python executable on an Ubuntu server) and the other two just use the default instead of specifically configuring ucs2 in their configurations of Python and I suspect that they don't know the difference and that it was an accident that they built a Python incompatible with other distributions of their operating system.

In sum, while it would be good to add the unicode setting to the platform's ABI (as discussed in setuptools ticket #78), it would also be good to make the default value be UCS4 instead of UCS2. This would fix all three of the potential incompatibilities that I found (listed below), and once we have proper inclusion of the unicode setting in the ABI in order to prevent the memory corruption, defaulting to UCS4 would increase the likelihood that a binary built on one distribution would be usable on another.

I'm sure that someone can come up with a reason why UCS2 is better than UCS4, but I'm also sure that the benefits of compatibility outweigh any benefits of UCS2 encoding, and that the widespread use of UCS4 demonstrates that there is nothing fatally wrong with it, and that people who really value UCS2 encoding more than compatibility can choose that for themselves by explicitly setting UCS2.

Let me restate that I am not suggesting taking away anyone's options, only making the setting for people who don't specify default to the compatible option. Hm, I guess that means that it should default to UCS2 on Windows and Mac and to UCS4 on Linux and Solaris.

Regards,

Zooko

Ubuntu 6.10 "edgy" i386: python: 2.4.4c1 (#2, Mar 7 2008, 03:03:38) [GCC 4.1.2 20060928 (prerelease) (Ubuntu 4.1.1-13ubuntu5)], maxunicode: 1114111 Ubuntu 7.04 "feisty": python: 2.5.1 (r251:54863, Jul 31 2008, 22:53:39) [GCC 4.1.2 (Ubuntu 4.1.2-0ubuntu4)], maxunicode: 1114111 Ubuntu 7.10 "gutsy" i386: python: 2.5.1 (r251:54863, Jul 31 2008, 23:17:40) [GCC 4.1.3 20070929 (prerelease) (Ubuntu 4.1.2-16ubuntu2)], maxunicode: 1114111 Ubuntu 8.04 "hardy" amd64: python: 2.5.2 (r252:60911, Jul 22 2009, 15:33:10) [GCC 4.2.4 (Ubuntu 4.2.4-1ubuntu3)], maxunicode: 1114111 Ubuntu 8.04 "hardy" i386: custom python: 2.6 (r26:66714, Oct 2 2008, 13:40:28) [GCC 4.2.3 (Ubuntu 4.2.3-2ubuntu7)], maxunicode: 65535 Ubuntu 8.04 "hardy" i386: python: 2.5.2 (r252:60911, Jul 22 2009, 15:35:03) [GCC 4.2.4 (Ubuntu 4.2.4-1ubuntu3)], maxunicode: 1114111 Ubuntu 9.04 "jaunty" amd64: custom python: 2.6.2 (release26-maint, Apr 19 2009, 01:58:18) [GCC 4.3.3], maxunicode: 1114111

Debian 4.0 "etch" i386: python: 2.4.4 (#2, Oct 22 2008, 19:52:44) [GCC 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)], maxunicode: 1114111 Debian 5.0 "lenny" i386: python: 2.5.2 (r252:60911, Jan 4 2009, 17:40:26) [GCC 4.3.2], maxunicode: 1114111 Debian 5.0 "lenny" amd64: python: 2.5.2 (r252:60911, Jan 4 2009, 21:59:32) [GCC 4.3.2], maxunicode: 1114111 Debian 5.0 "lenny" armv5tel: python: 2.5.2 (r252:60911, Jan 5 2009, 02:00:00) [GCC 4.3.2], maxunicode: 1114111 Debian unstable "squeeze/sid" i386: python: 2.5.4 (r254:67916, Feb 17 2009, 20:16:45) [GCC 4.3.3], maxunicode: 1114111

Fedora 11 "leonidas" amd64: python: 2.6 (r26:66714, Jul 4 2009, 17:37:13) [GCC 4.4.0 20090506 (Red Hat 4.4.0-4)], maxunicode: 1114111

ArchLinux: python: 2.6.2 (r262:71600, Jul 20 2009, 02:23:30) [GCC 4.4.0 20090630 (prerelease)], maxunicode: 65535

NetBSD 4: python: 2.5.2 (r252:60911, Mar 20 2009, 14:00:07) [GCC 4.1.2 20060628 prerelease (NetBSD nb2 20060711)], maxunicode: 65535

OpenSolaris SunOS-5.11-i86pc-i386-32bit: python: 2.4.4 (#1, Mar 10 2009, 09:35:36) [C], maxunicode: 65535 Nexenta NCP1 SunOS-5.11-i86pc-i386-32bit: python: 2.4.3 (#2, May 3 2006, 19:12:42) [GCC 4.0.3 (GNU_OpenSolaris 4.0.3-1nexenta4)], maxunicode: 1114111

Mac OS 10.6 "snow leopard" i386: python: 2.6.1 (r261:67515, Jul 7 2009, 23:51:51) [GCC 4.2.1 (Apple Inc. build 5646)], maxunicode: 65535 Mac OS 10.5 "leopard" i386: python: 2.5.1 (r251:54863, Feb 6 2009, 19:02:12) [GCC 4.0.1 (Apple Inc. build 5465)], maxunicode: 65535 Mac OS 10.4 "tiger" custom python: 2.5.4 (release25-maint:72153M, Apr 30 2009, 12:28:20) [GCC 4.0.1 (Apple Computer, Inc. build 5367)], maxunicode: 65535

Cygwin CYGWIN_NT-5.1-1.5.25-0.156-4-2-i686-32bit-WindowsPE: python: 2.5.2 (r252:60911, Dec 2 2008, 09:26:14) [GCC 3.4.4 (cygming special, gdc 0.12, using dmd 0.125)], maxunicode: 65535

Windows: python: 2.6.2 (r262:71600, Apr 21 2009, 15:05:37) [MSC v.1500 32 bit (Intel)], maxunicode: 65535 Windows: python: 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit (Intel)], maxunicode: 65535



More information about the Python-Dev mailing list