[Python-Dev] thoughts on the bytes/string discussion (original) (raw)

Stephen J. Turnbull stephen at xemacs.org
Sat Jun 26 19:24:50 CEST 2010

Previous message: [Python-Dev] thoughts on the bytes/string discussion
Next message: [Python-Dev] thoughts on the bytes/string discussion
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Greg Ewing writes:

Would there be any sanity in having an option to compile Python with UTF-8 as the internal string representation?

Losing Py_UNICODE as mentioned by Stefan Behnel (IIRC) is just the beginning of the pain.

If Emacs's experience is any guide, the cost in speed and complexity of a variable-width internal representation is high. There are a number of tricks you can use, but basically everything becomes O(n) for the natural implementation of most operations (such as indexing by character). You can get around that with a position cache, of course, but that adds complexity, and really cuts into the space saving (and worse, adds another chunk that may or may not be paged in when you need it).

What we're considering is a system where buffers come in 1-, 2-, and 4-octet widechars, with automatic translation depending on content. But the buffer is the primary random-access structure in Emacsen, so optimizing it is probably worth our effort. I doubt it would be worth it for Python, but my intuitions here are not reliable.

Previous message: [Python-Dev] thoughts on the bytes/string discussion
Next message: [Python-Dev] thoughts on the bytes/string discussion
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list