[Python-Dev] Unicode charmap decoders slow (original) (raw)

Tony Nelson tonynelson at georgeanelson.com
Wed Oct 5 04:52:22 CEST 2005


[Recipient list not trimmed, as my replies must be vetted by a moderator, which seems to delay them. :]

At 11:48 PM +0200 10/4/05, Walter Dörwald wrote:

Am 04.10.2005 um 21:50 schrieb Martin v. Löwis:

Walter Dörwald wrote:

For charmap decoding we might be able to use an array (e.g. a tuple (or an array.array?) of codepoints instead of dictionary.

This array would have to be sparse, of course. For encoding yes, for decoding no. Using an array.array would be more efficient, I guess - but we would need a C API for arrays (to validate the type code, and to get obitem). For decoding it should be sufficient to use a unicode string of length 256. u"\ufffd" could be used for "maps to undefined". Or the string might be shorter and byte values greater than the length of the string are treated as "maps to undefined" too.

With Unicode using more than 64K codepoints now, it might be more forward looking to use a table of 256 32-bit values, with no need for tricky values. There is no need to add any C code to the codecs; just add some more code to the existing C function (which, if I have it right, is PyUnicode_DecodeCharmap() in unicodeobject.c).

...

For encoding, having a C trie might give considerable speedup. codecs could offer an API to convert the current dictionaries into lookup-efficient structures, and the conversion would be done when importing the codec.

For the trie, two levels (higher and lower byte) would probably be sufficient: I believe most encodings only use 2 "rows" (256 code point blocks), very few more than three. This might work, although nobody has complained about charmap encoding yet. Another option would be to generate a big switch statement in C and let the compiler decide about the best data structure.

I'm willing to complain. :) I might allow saving of my (53 MB) MBox file. (Not that editing received mail makes as much sense as searching it.)

Encoding can be made fast using a simple hash table with external chaining. There are max 256 codepoints to encode, and they will normally be well distributed in their lower 8 bits. Hash on the low 8 bits (just mask), and chain to an area with 256 entries. Modest storage, normally short chains, therefore fast encoding.

At 12:08 AM +0200 10/5/05, Martin v. Löwis wrote:

I would try to avoid generating C code at all costs. Maintaining the build processes will just be a nightmare.

I agree; also I don't think the generated codecs need to be changed at all. All the changes can be made to the existing C functions, by adding caching per a reply of mine that hasn't made it to the list yet. Well, OK, something needs to hook writes to the codec's dictionary, but I /think/ that only needs Python code. I say:

...I suggest instead just /caching/ the translation in C arrays stored with the codec object. The cache would be invalidated on any write to the codec's mapping dictionary, and rebuilt the next time anything was translated. This would maintain the present semantics, work with current codecs, and still provide the desired speed improvement.

Note that this caching is done by new code added to the existing C functions (which, if I have it right, are in unicodeobject.c). No architectural changes are made; no existing codecs need to be changed; everything will just work, and usually work faster, with very modest memory requirements of one 256 entry array of 32-bit Unicode values and a hash table with 256 1-byte slots and 256 chain entries, each having a 4 byte Unicode value, a byte output value, a byte chain index, and probably 2 bytes of filler, for a hash table size of 2304 bytes per codec.


TonyN.:' <mailto:tonynelson at georgeanelson.com> ' <http://www.georgeanelson.com/>



More information about the Python-Dev mailing list