[Python-Dev] Divorcing str and unicode (no more implicit conversions). (original) (raw)

M.-A. Lemburg mal at egenix.com
Tue Oct 25 10:38:14 CEST 2005


Neil Hodgson wrote:

M.-A. Lemburg:

Unicode has the concept of combining code points, e.g. you can store an "é" (e with a accent) as "e" + "'". Now if you slice off the accent, you'll break the character that you encoded using combining code points. ... next(u, index) -> integer Returns the Unicode object index for the start of the next found after u[index] or -1 in case no next element of this type exists. Should entity breakage be further discouraged by returning a slice here rather than an object index?

You mean a slice that slices out the next ?

Something like:

i = firstgrapheme(u) x = 0 while x < width and u[i] != "\n": x, = draw(u[i], (x, y)) i = nextgrapheme(u, i)

This sounds a lot like you'd want iterators for the various index types. Should be possible to implement on top of the proposed APIs, e.g. itergraphemes(u), itercodepoints(u), etc.

Note that what most people refer to as "character" is a grapheme in Unicode speak. Given that interpretation, "breaking" Unicode "characters" is something you won't ever work around with by using larger code units such as UCS4 compatible ones.

Furthermore, you should also note that surrogates (two code units encoding one code point) are part of Unicode life. While you don't need them when storing Unicode in UCS4 code units, they can still be part of the Unicode data and the programmer has to be aware of these.

I personally, don't think that slicing Unicode is such a big issue. If you know what you are doing, things tend not to break - which is true for pretty much everything you do in programming ;-)

-- Marc-Andre Lemburg eGenix.com

Professional Python Services directly from the Source (#1, Oct 25 2005)

Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/


::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::



More information about the Python-Dev mailing list