[Python-Dev] Unicode charmap decoders slow (original) (raw)
Walter Dörwald walter at livinglogic.de
Wed Oct 5 17:08:04 CEST 2005
- Previous message: [Python-Dev] Unicode charmap decoders slow
- Next message: [Python-Dev] Unicode charmap decoders slow
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Martin v. Löwis wrote:
Tony Nelson wrote:
For decoding it should be sufficient to use a unicode string of length 256. u"\ufffd" could be used for "maps to undefined". Or the string might be shorter and byte values greater than the length of the string are treated as "maps to undefined" too. With Unicode using more than 64K codepoints now, it might be more forward looking to use a table of 256 32-bit values, with no need for tricky values. You might be missing the point. \ufffd is REPLACEMENT CHARACTER, which would indicate that the byte with that index is really unused in that encoding.
OK, here's a patch that implements this enhancement to PyUnicode_DecodeCharmap(): http://www.python.org/sf/1313939
The mapping argument to PyUnicode_DecodeCharmap() can be a unicode string and is used as a decoding table.
Speed looks like this:
python2.4 -mtimeit "s='a'531024; u=unicode(s)" "s.decode('utf-8')" 1000 loops, best of 3: 538 usec per loop python2.4 -mtimeit "s='a'531024; u=unicode(s)" "s.decode('mac-roman')" 100 loops, best of 3: 3.85 msec per loop ./python-cvs -mtimeit "s='a'531024; u=unicode(s)" "s.decode('utf-8')" 1000 loops, best of 3: 539 usec per loop ./python-cvs -mtimeit "s='a'531024; u=unicode(s)" "s.decode('mac-roman')" 1000 loops, best of 3: 623 usec per loop
Creating the decoding_map as a string should probably be done by gencodec.py directly. This way the first import of the codec would be faster too.
Bye, Walter Dörwald
- Previous message: [Python-Dev] Unicode charmap decoders slow
- Next message: [Python-Dev] Unicode charmap decoders slow
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]