[Python-Dev] len(chr(i)) = 2? (original) (raw)

Stephen J. Turnbull stephen at xemacs.org
Wed Nov 24 10:03:29 CET 2010

Previous message: [Python-Dev] len(chr(i)) = 2?
Next message: [Python-Dev] len(chr(i)) = 2?
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

James Y Knight writes:

a) You seem to be hung up implementation details of emacs.

Hung up? No. It's the program whose text model I know best, and even if its design could theoretically be a lot better for this purpose, I can't say I've seen a real program whose model is obviously better for the purpose of a language for implementing text editors.[1] So it's not obvious to me that its model can be ruled out on a priori grounds. If not, it would be nice if your new language could implement it efficiently without contorted programming.

But yes, positions should be stored as an byte offset into the utf8 string. NOT as number of codepoints since the beginning of the string. Probably you want it to be somewhat opaque, so that you actually have to specify whether you wanted to go to +1 byte, codepoint, or grapheme.

Well, first of all, +1 byte should not be available to a text iterator, at least not with the same iterator/position object that implements character and/or grapheme movement. (You seem to have thought about this issue a lot, but mixing bytes with text units makes wonder how much practical implementation you've done.)

Second, incrementing to grapheme boundaries is relatively easy to do efficiently, just as incrementing to a UTF-8 character boundary is easy to do. We already do the latter, the former is pragmatically harder, but not a conceptual stretch. That's not the question. The question is how do we identify an arbitrary position in the text? Sometimes it's nice to have a numerical measure of size or location.

It is not obvious that position by grapheme count is going to be the obvious way to determine position in a text. Eg, for languages with variable metric characters, character counts as a way of lining up table columns is going the way of Tyrannosaurus. In the Han-using languages, yes, column counts within lines are going to be important forever, because the characters are literally square for most practical purposes ... but they don't use composing characters (all the Japanese kana are precomposed, for example), so position by grapheme is going to be very close to position by character, and fine positioning will be done either by mouse or by incrementing the last few characters. Nor do I think operations like "advance 1,000,000 characters" will have less meaning than "advance 1,000,000 graphemes." Both of them are just a way of saying "go way far away", end up in about the same place, and where there's a bias, it will be pretty consistent in a statistical sense for any given natural language (and therefore, for 99% of users).

But once you [the language implementor] are providing correct abstractions for grapheme movement, it's just as easy to also provide an abstraction for codepoint movement, and make your low-level implementation of the iterator object be a byte-offset into a UTF8 buffer.

Sure, that's fine for something that just iterates over the text. But if you actually need to remember positions, or regions, to jump to later or to communicate to other code that manipulates them, doing this stuff the straightforward way (just copying the whole iterator object to hang on to its state) becomes expensive. You end up proliferating types that all do the same kind of thing. Judicious use of inheritance helps, but getting the fundamental abstraction right is hard. Or least, Emacs hasn't found it in 20 years of trying.

OTOH, all that stuff "just works" and just works efficiently, up to the grapheme vs. character issue, with an array.

About that issue, to go back to tired old Emacs, all of the things I can think of that I might want to do by grapheme (display, insert, delete, move a few places) do fit the "increment until done" model. These things already work quite well for the variable-width buffer that "multilingual" Emacsen use, whether the old Mule encoding or UTF-8. So I can see how the UTF-8 model with appropriate iterators for characters and graphemes can work well for lots of applications and use cases.

But Emacs already has opaque "markers", yet nevertheless the use of integer character positions in strings and buffers has survived. That may have to do with mutability, and the "all the world is a buffer" design, as Glyph suggested, but I think it more likely that markers are very expense to create and use compared to integers. Perhaps an editor of power similar to Emacs could be implemented with string operations on lines, or the like, and these issues would go away. But it's not obvious to me.

Footnotes: [1] Yes, I know that not all programs are text editors. So shoot me. It's still the text manipulation program I know best, and it's not obvious to me that it's the unique class that would need these features.

Previous message: [Python-Dev] len(chr(i)) = 2?
Next message: [Python-Dev] len(chr(i)) = 2?
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list