[Python-Dev] AlternativeImplementation forPEP292:SimpleString

Substitutions ([original](https://mail.python.org/pipermail/python-dev/2004-September/048770.html)) ([raw](?raw))

Stephen J. Turnbull stephen at xemacs.org
Fri Sep 10 07:38:38 CEST 2004


"Gareth" == Gareth McCaughan <gmccaughan at synaptics-uk.com> writes:

Gareth> That said, I strongly agree that all textual data should
Gareth> be Unicode as far as the developer is concerned; but, at
Gareth> least in the USA :-), it makes sense to have an optimized
Gareth> representation that saves space for ASCII-only text, just
Gareth> as we have an optimized representation for small integers.

This is not at all obvious. As MAL just pointed out, if efficiency is a goal, text algorithms often need to be different for operations on texts that are dense in an 8-bit character space, vs texts that are sparse in a 16-bit or 20-bit character space. Note that that is what is talking about too; he points to SRE and ElementTree.

When viewed from that point of view, the subtext to 's comment is "I don't want to separately maintain 8-bit versions of new text facilities to support my non-Unicode applications, I want to impose that burden on the authors of text-handling PEPs." That may very well be the best thing for Python; as has done a lot of Unicode implementation for Python, he's in a good position to make such judgements. But the development costs MAL refers to are bigger than you are estimating, and will continue as long as that policy does.

While I'm very sympathetic to 's view that there's more than one way to skin a cat, and a good cat-handling design should account for that, and conceding his expertise, none-the-less I don't think that Python really wants to maintain more than one text-processing system by default. Of course if you restrict yourself to the class of ASCII- only strings, you can do better, and of course that is a huge class of strings. But that, as such, is important only to efficiency fanatics.

The question is, how often are people going to notice that when they have pure ASCII they get a 100% speedup, or that they actually can just suck that 3GB ASCII file into their 4GB memory, rather than buffering it as 3 (or 6) 2GB Unicode strings? Compare how often people are going to notice that a new facility "just works" for Japanese or Hindi. I just don't see the former being worth the extra effort, while the latter makes the "this or that" choice clear. If a single representation is enough, it had better be Unicode-based, and the others can be supported in libraries (which turn binary blobs into non-standard text objects with appropriate methods) as the need arises.

-- Institute of Policy and Planning Sciences http://turnbull.sk.tsukuba.ac.jp University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN Ask not how you can "do" free software business; ask what your business can "do for" free software.



More information about the Python-Dev mailing list