[Python-Dev] Unicode charmap decoders slow (original) (raw)
Walter Dörwald walter at livinglogic.de
Wed Oct 5 10:21:06 CEST 2005
- Previous message: [Python-Dev] Unicode charmap decoders slow
- Next message: [Python-Dev] Unicode charmap decoders slow
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Am 05.10.2005 um 00:08 schrieb Martin v. Löwis:
Walter Dörwald wrote:
This array would have to be sparse, of course.
For encoding yes, for decoding no. [...] For decoding it should be sufficient to use a unicode string of length 256. u"\ufffd" could be used for "maps to undefined". Or the string might be shorter and byte values greater than the length of the string are treated as "maps to undefined" too. Right. That's what I meant with "sparse": you somehow need to represent "no value".
OK, but I don't think that we really need a sparse data structure for
that. I used the following script to check that:
import sys, os.path, glob, encodings
has = 0 hasnt = 0
for enc in glob.glob("%s/*.py" % os.path.dirname(encodings.file)): enc = enc.rsplit(".")[-2].rsplit("/")[-1] try: import("encodings.%s" % enc) codec = sys.modules["encodings.%s" % enc] except: pass else: if hasattr(codec, "decoding_map"): print codec for i in xrange(0, 256): if codec.decoding_map.get(i, None) is not None: has += 1 else: hasnt += 1 print "assigned values:", has, "unassigned values:", hasnt
It reports that in all the charmap codecs there are 15292 assigned
byte values and only 324 unassigned ones. I.e. only about 2% of the
byte values map to "undefined". Storing those codepoints in the array
as U+FFFD would only need 648 (or 1296 for wide builds) additional
bytes. I don 't think a sparse data structure could beat that.
This might work, although nobody has complained about charmap encoding yet. Another option would be to generate a big switch statement in C and let the compiler decide about the best data structure. I would try to avoid generating C code at all costs. Maintaining the build processes will just be a nightmare.
Sounds resonable.
Bye, Walter Dörwald
- Previous message: [Python-Dev] Unicode charmap decoders slow
- Next message: [Python-Dev] Unicode charmap decoders slow
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]