[Python-Dev] Unicode charmap decoders slow (original) (raw)

"Martin v. Löwis" martin at v.loewis.de
Wed Oct 5 08:36:58 CEST 2005

Previous message: [Python-Dev] Unicode charmap decoders slow
Next message: [Python-Dev] Unicode charmap decoders slow
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Tony Nelson wrote:

For decoding it should be sufficient to use a unicode string of length 256. u"\ufffd" could be used for "maps to undefined". Or the string might be shorter and byte values greater than the length of the string are treated as "maps to undefined" too.

With Unicode using more than 64K codepoints now, it might be more forward looking to use a table of 256 32-bit values, with no need for tricky values.

You might be missing the point. \ufffd is REPLACEMENT CHARACTER, which would indicate that the byte with that index is really unused in that encoding.

Encoding can be made fast using a simple hash table with external chaining. There are max 256 codepoints to encode, and they will normally be well distributed in their lower 8 bits. Hash on the low 8 bits (just mask), and chain to an area with 256 entries. Modest storage, normally short chains, therefore fast encoding.

This is what is currently done: a hash map with 256 keys. You are complaining about the performance of that algorithm. The issue of external chaining is likely irrelevant: there likely are no collisions, even though Python uses open addressing.

...I suggest instead just /caching/ the translation in C arrays stored with the codec object. The cache would be invalidated on any write to the codec's mapping dictionary, and rebuilt the next time anything was translated. This would maintain the present semantics, work with current codecs, and still provide the desired speed improvement.

That is not implementable. You cannot catch writes to the dictionary.

Note that this caching is done by new code added to the existing C functions (which, if I have it right, are in unicodeobject.c). No architectural changes are made; no existing codecs need to be changed; everything will just work

Please try to implement it. You will find that you cannot. I don't see how regenerating/editing the codecs could be avoided.

Regards, Martin

Previous message: [Python-Dev] Unicode charmap decoders slow
Next message: [Python-Dev] Unicode charmap decoders slow
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list