[Python-Dev] len(chr(i)) = 2? (original) (raw)

Stephen J. Turnbull stephen at xemacs.org
Thu Nov 25 04:55:40 CET 2010


Greg Ewing writes:

On 24/11/10 22:03, Stephen J. Turnbull wrote:

But if you actually need to remember positions, or regions, to jump to later or to communicate to other code that manipulates them, doing this stuff the straightforward way (just copying the whole iterator object to hang on to its state) becomes expensive.

If the internal representation of a text pointer (I won't call it an iterator because that means something else in Python) is a byte offset or something similar, it shouldn't take up any more space than a Python int, which is what you'd be using anyway if you represented text positions by grapheme indexes or whatever.

That's not necessarily true. Eg, in Emacs ("there you go again"), Lisp integers are not only immediate (saving one pointer), but the type is encoded in the lower bits, so that there is no need for a type pointer -- the representation is smaller than the opaque marker type. Altogether, up to 8 of 12 bytes saved on a 32-bit platform, or 16 of 24 bytes on a 64-bit platform.

In Python it's true that markers can use the same data structure as integers and simply provide different methods, and it's arguable that Python's design is better. But if you use bytes internally, then you have problems. Do you expose that byte value to the user? Can users (programmers using the language and end users) specify positions in terms of byte values? If so, what do you do if the user specifies a byte value that points into a multibyte character? What if the user wants to specify position by number of characters? Can you translate efficiently?

As I say elsewhere, it's possible that there really never is a need to efficiently specify an absolute position in a large text as a character (grapheme, whatever) count. But I think it would be hard to implement an efficient text-processing language, eg, a Python module for full conformance in handling Unicode, on top of UTF-8. Any time you have an algorithm that requires efficient access to arbitrary text positions, you'll spend all your skull sweat fighting the representation. At least, that's been my experience with Emacsen.

So I don't really see what you're arguing for here. How do you think positions in unicode strings should be represented?

I think what users should see is character positions, and they should be able to specify them numerically as well as via an opaque marker object. I don't care whether that position is represented as bytes or characters internally, except that the experience of Emacsen is that representation as byte positions is both inefficient and fragile. The representation as character positions is more robust but slightly more inefficient.



More information about the Python-Dev mailing list