[Python-Dev] Unicode patches checked in (original) (raw)

M.-A. Lemburg mal@lemburg.com
Wed, 15 Mar 2000 18:26:15 +0100


Christian Tismer wrote:

Fredrik Lundh wrote: > > CT: > > How do I build a dist that doesn't need to change a lot of > > stuff in the user's installation? > > somewhere in this thread, Guido wrote: > > > BTW, I added a tag "pre-unicode" to the CVS tree to the revisions > > before the Unicode changes were made. > > maybe you could base SLP on that one? I have no idea how this works. Would this mean that I cannot get patctes which come after unicode? Meanwhile, I've looked into the sources. It is easy for me to get rid of the problem by supplying my own unicodedata.c, where I replace all functions by some unimplemented exception.

No need (see my other posting): simply disable the module altogether... this shouldn't hurt any part of the interpreter as the module is a user-land only module.

Furthermore, I wondered about the data format. Is the unicode database used inyou re package as well? Otherwise, I see only references form unicodedata.c, and that means the data structure can be massively enhanced. At the moment, that baby is 64k entries long, with four bytes and an optional string. This is a big waste. The strings are almost all some distinct prefixes, together with a list of hex smallwords. This is done as strings, probably this makes 80 percent of the space.

I have made no attempt to optimize the structure... (due to lack of time mostly) the current implementation is really not much different from a rewrite of the UnicodeData.txt file availble at the unicode.org site.

If you want to, I can mail you the marshalled Python dict version of that database to play with.

The only function that uses the "decomposition" field (namely the string) is unicodedatadecomposition. It does nothing more than to wrap it into a PyObject. We can do a little better here. I gues I can bring it down to a third of this space without much effort, just by using - binary encoding for the tags as enumeration - binary encoding of the hexed entries - omission of the spaces Instead of a 64 k of structures which contain pointers anyway, I can use a 64k pointer array with offsets into one packed table.

The unicodedata access functions would change slightly, just building some hex strings and so on. I guess this is not a time critical section?

It may be if these functions are used in codecs, so you should pay attention to speed too...

Should I try this evening? :-)

Sure :-) go ahead...

-- Marc-Andre Lemburg


Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/