[Python-Dev] Unicode charmap decoders slow (original) (raw)
Tony Nelson tonynelson at georgeanelson.com
Tue Oct 4 03:11:29 CEST 2005
- Previous message: [Python-Dev] PEP 343 and __with__
- Next message: [Python-Dev] Unicode charmap decoders slow
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Is there a faster way to transcode from 8-bit chars (charmaps) to utf-8 than going through unicode()?
I'm writing a small card-file program. As a test, I use a 53 MB MBox file, in mac-roman encoding. My program reads and parses the file into messages in about 3 to 5 seconds (Wow! Go Python!), but takes about 14 seconds to iterate over the cards and convert them to utf-8:
for i in xrange(len(cards)):
u = unicode(cards[i], encoding)
cards[i] = u.encode('utf-8')
The time is nearly all in the unicode() call. It's not so much how much time it takes, but that it takes 4 times as long as the real work, just to do table lookups.
Looking at the source (which, if I have it right, is PyUnicode_DecodeCharmap() in unicodeobject.c), I think it is doing a dictionary lookup for each character. I would have thought that it would make and cache a LUT the size of the charmap (and hook the relevent dictionary stuff to delete the cached LUT if the dictionary is changed). (You may consider this a request for enhancement. ;)
I thought of using U"".translate(), but the unicode version is defined to be slow, and anyway I can't find any way to just shove my 8-bit data into a unicode string without translation. Is there some similar approach? I'm almost (but not quite) ready to try it in Pyrex.
I'm new to Python. I didn't google anything relevent on python.org or in groups. I posted this in comp.lang.python yesterday, got a couple of responses, but I think this may be too sophisticated a question for that group.
I'm not a member of this list, so please copy me on replies so I don't have to hunt them down in the archive.
TonyN.:' <mailto:tonynelson at georgeanelson.com> ' <http://www.georgeanelson.com/>
- Previous message: [Python-Dev] PEP 343 and __with__
- Next message: [Python-Dev] Unicode charmap decoders slow
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]