[Python-Dev] Unicode charmap decoders slow (original) (raw)

Walter Dörwald walter at livinglogic.de
Wed Oct 5 17:08:04 CEST 2005

Previous message: [Python-Dev] Unicode charmap decoders slow
Next message: [Python-Dev] Unicode charmap decoders slow
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Martin v. Löwis wrote:

Tony Nelson wrote:

For decoding it should be sufficient to use a unicode string of length 256. u"\ufffd" could be used for "maps to undefined". Or the string might be shorter and byte values greater than the length of the string are treated as "maps to undefined" too. With Unicode using more than 64K codepoints now, it might be more forward looking to use a table of 256 32-bit values, with no need for tricky values. You might be missing the point. \ufffd is REPLACEMENT CHARACTER, which would indicate that the byte with that index is really unused in that encoding.

OK, here's a patch that implements this enhancement to PyUnicode_DecodeCharmap(): http://www.python.org/sf/1313939

The mapping argument to PyUnicode_DecodeCharmap() can be a unicode string and is used as a decoding table.

Speed looks like this:

python2.4 -mtimeit "s='a'531024; u=unicode(s)" "s.decode('utf-8')" 1000 loops, best of 3: 538 usec per loop python2.4 -mtimeit "s='a'531024; u=unicode(s)" "s.decode('mac-roman')" 100 loops, best of 3: 3.85 msec per loop ./python-cvs -mtimeit "s='a'531024; u=unicode(s)" "s.decode('utf-8')" 1000 loops, best of 3: 539 usec per loop ./python-cvs -mtimeit "s='a'531024; u=unicode(s)" "s.decode('mac-roman')" 1000 loops, best of 3: 623 usec per loop

Creating the decoding_map as a string should probably be done by gencodec.py directly. This way the first import of the codec would be faster too.

Bye, Walter Dörwald

Previous message: [Python-Dev] Unicode charmap decoders slow
Next message: [Python-Dev] Unicode charmap decoders slow
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list