[Python-Dev] len(chr(i)) = 2? (original) (raw)

Glyph Lefkowitz glyph at twistedmatrix.com
Fri Nov 26 08:51:35 CET 2010

Previous message: [Python-Dev] len(chr(i)) = 2?
Next message: [Python-Dev] len(chr(i)) = 2?
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Nov 24, 2010, at 10:55 PM, Stephen J. Turnbull wrote:

Greg Ewing writes:

On 24/11/10 22:03, Stephen J. Turnbull wrote:

But if you actually need to remember positions, or regions, to jump to later or to communicate to other code that manipulates them, doing this stuff the straightforward way (just copying the whole iterator object to hang on to its state) becomes expensive.

If the internal representation of a text pointer (I won't call it an iterator because that means something else in Python) is a byte offset or something similar, it shouldn't take up any more space than a Python int, which is what you'd be using anyway if you represented text positions by grapheme indexes or whatever. That's not necessarily true. Eg, in Emacs ("there you go again"), Lisp integers are not only immediate (saving one pointer), but the type is encoded in the lower bits, so that there is no need for a type pointer -- the representation is smaller than the opaque marker type. Altogether, up to 8 of 12 bytes saved on a 32-bit platform, or 16 of 24 bytes on a 64-bit platform.

Yes, yes, lisp is very clever. Maybe some other runtime, like PyPy, could make this optimization. But I don't think that anyone is filling up main memory with gigantic piles of character indexes and need to squeeze out that extra couple of bytes of memory on such a tiny object. Plus, this would allow such a user to stop copying the character data itself just to decode it, and on mostly-ascii UTF-8 text (a common use-case) this is a 2x savings right off the bat.

In Python it's true that markers can use the same data structure as integers and simply provide different methods, and it's arguable that Python's design is better. But if you use bytes internally, then you have problems.

No, you just have design questions.

Do you expose that byte value to the user?

Yes, but only if they ask for it. It's useful for computing things like quota and the like.

Can users (programmers using the language and end users) specify positions in terms of byte values?

Sure, why not?

If so, what do you do if the user specifies a byte value that points into a multibyte character?

Go to the beginning of the multibyte character. Report that position; if the user then asks the requested marker object for its position, it will report that byte offset, not the originally-requested one. (Obviously, do the same thing for surrogate pair code points.)

What if the user wants to specify position by number of characters?

Part of the point that we are trying to make here is that nobody really cares about that use-case. In order to know anything useful about a position in a text, you have to have traversed to that location in the text. You can remember interesting things like the offsets of starts of lines, or the x/y positions of characters.

Can you translate efficiently?

No, because there's no point :). But you could implement an overlay that cached things like the beginning of lines, or the x/y positions of interesting characters.

As I say elsewhere, it's possible that there really never is a need to efficiently specify an absolute position in a large text as a character (grapheme, whatever) count.

But I think it would be hard to implement an efficient text-processing language, eg, a Python module for full conformance in handling Unicode, on top of UTF-8.

Still: why? I guess if I have some free time I'll try my hand at it, and maybe I'll run into a wall and realize you're right :).

Any time you have an algorithm that requires efficient access to arbitrary text positions, you'll spend all your skull sweat fighting the representation. At least, that's been my experience with Emacsen.

What sort of algorithm would that be, though? The main thing that I could think of is a text editor trying to efficiently allow the user to scroll to the middle of a large file without reading the whole thing into memory. But, in that case, you could use byte-positions to estimate, and display an heuristic number while calculating the real line numbers. (This is what 'less' does, and it seems to work well.)

So I don't really see what you're arguing for here. How do you think positions in unicode strings should be represented? I think what users should see is character positions, and they should be able to specify them numerically as well as via an opaque marker object. I don't care whether that position is represented as bytes or characters internally, except that the experience of Emacsen is that representation as byte positions is both inefficient and fragile. The representation as character positions is more robust but slightly more inefficient.

Is it really the representation as byte positions which is fragile (i.e. the internal implementation detail), or the exposure of that position to calling code, and the idiomatic usage of that number as an integer?

-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.python.org/pipermail/python-dev/attachments/20101126/8455b449/attachment.html>

Previous message: [Python-Dev] len(chr(i)) = 2?
Next message: [Python-Dev] len(chr(i)) = 2?
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list