[Python-3000] Making more effective use of slice objects in Py3k (original) (raw)

Guido van Rossum guido at python.org
Thu Aug 31 20:55:15 CEST 2006


On 8/31/06, Talin <talin at acm.org> wrote:

One way to handle this efficiently would be to only support the encodings which have a constant character size: ASCII, Latin-1, UCS-2 and UTF-32. In other words, if the content of your text is plain ASCII, use an 8-bit-per-character string; If the content is limited to the Unicode BMF (Basic Multilingual Plane) use UCS-2; And if you are using Unicode supplementary characters, use UTF-32.

(The difference between UCS-2 and UTF-16 is that UCS-2 is always 2 bytes per character, and doesn't support the supplemental characters above 0xffff, whereas UTF-16 characters can be either 2 or 4 bytes.)

I think we should also support UTF-16, since Java and .NET (and Win32?) appear to be using effectively; making surrogate handling an application issue doesn't seem too big of a burden for many apps.

By avoiding UTF-8, UTF-16 and other variable-character-length formats, you can always insure that character index operations are done in constant time. Index operations would simply require scaling the index by the character size, rather than having to scan through the string and count characters.

The drawback of this method is that you may be forced to transform the entire string into a wider encoding if you add a single character that won't fit into the current encoding.

A way to handle UTF-8 strings and other variable-length encodings would be to maintain a small cache of index positions with the string object.

(Another option is to simply make all strings UTF-32 -- which is not that unreasonable, considering that text strings normally make up only a small fraction of a program's memory footprint. I am sure that there are applications that don't conform to this generalization, however. )

Here you are effectively voting against polymorphic strings. I believe Fredrik has good reasons to doubt this assertion.

-- --Guido van Rossum (home page: http://www.python.org/~guido/)



More information about the Python-3000 mailing list