[Python-Dev] Unicode charmap decoders slow (original) (raw)

Walter Dörwald walter at livinglogic.de
Tue Oct 4 23:48:08 CEST 2005

Previous message: [Python-Dev] Unicode charmap decoders slow
Next message: [Python-Dev] Unicode charmap decoders slow
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Am 04.10.2005 um 21:50 schrieb Martin v. Löwis:

Walter Dörwald wrote:

For charmap decoding we might be able to use an array (e.g. a tuple (or an array.array?) of codepoints instead of dictionary.

This array would have to be sparse, of course.

For encoding yes, for decoding no.

Using an array.array would be more efficient, I guess - but we would need a C API for arrays (to validate the type code, and to get obitem).

For decoding it should be sufficient to use a unicode string of
length 256. u"\ufffd" could be used for "maps to undefined". Or the
string might be shorter and byte values greater than the length of
the string are treated as "maps to undefined" too.

Or we could implement this array as a C array (i.e. gencodec.py would generate C code).

For decoding, we would not get any better than array.array, except for startup cost.

Yes.

For encoding, having a C trie might give considerable speedup. codecs could offer an API to convert the current dictionaries into lookup-efficient structures, and the conversion would be done when importing the codec.

For the trie, two levels (higher and lower byte) would probably be sufficient: I believe most encodings only use 2 "rows" (256 code point blocks), very few more than three.

This might work, although nobody has complained about charmap
encoding yet. Another option would be to generate a big switch
statement in C and let the compiler decide about the best data
structure.

Bye, Walter Dörwald

Previous message: [Python-Dev] Unicode charmap decoders slow
Next message: [Python-Dev] Unicode charmap decoders slow
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list