[Python-Dev] Ill-defined encoding for CP875? (original) (raw)
Tim Peters tim_one@email.msn.com
Sat, 12 May 2001 07:28:27 -0400
- Previous message: [Python-Dev] Hats off to them!
- Next message: [Python-Dev] Ill-defined encoding for CP875?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
I have a way to make dict lookup a teensy bit cheaper(*) that significantly reduces the number of collisions (which is much more valuable).
This caused a number of std tests to fail, because they were implicitly relying on the order in which a dict's entries are materialized via .keys() or .items().
Most of these were easy enough to fix. The last failure remaining is test_unicode, and I don't know how to fix it. It's dying here:
try:
verify(unicode(s,encoding).encode(encoding) == s)
except TestFailed:
print '*** codec "%s" failed round-trip' % encoding
except ValueError,why:
print '*** codec for "%s" failed: %s' % (encoding, why)
when encoding == "cp875". There's a bogus problem you have to worm around first: test_unicode neglected to import TestFailed, so it actually dies with NameError while trying the "except TestFailed" clause after verify() raises TestFailed. Once that's repaired, it's complaining about failing the round-trip encoding.
The original character in s it's griping about is "?" (0x3f). cp875.py has this entry in its decoding_map dict:
0x003f: 0x001a, # SUBSTITUTE
But 0x1a is not a unique value in this dict. There's also
0x00dc: 0x001a, # SUBSTITUTE
0x00e1: 0x001a, # SUBSTITUTE
0x00ec: 0x001a, # SUBSTITUTE
0x00ed: 0x001a, # SUBSTITUTE
0x00fc: 0x001a, # SUBSTITUTE
0x00fd: 0x001a, # SUBSTITUTE
Therefore what appears associated with 0x1a in the derived encoding_map dict:
encoding_map = {} for k,v in decoding_map.items(): encoding_map[v] = k
may end up being any of the 7 decoding_map keys that map to 0x1a. It just so happened to map back to 0x3f before, but to 0xfd after the dict change, so "?" doesn't survive the round trip anymore.
My knowledge of encoding internals is exceeded only by my mastery of file URLs under Windows , so I could sure use some help getting this repaired. I'd really like to check in the dict improvement (+ test repairs), but won't do it so long as it makes a std test fail. If, e.g., you're relying on "the first" of a set of ambiguous reverse mappings winning the game, then iterating over decoding_map.items() in reverse sorted order would do the trick reliablly. But I don't know whether the ambiguity in cp875 is a bug or an undocumented feature ...
7-bit-ascii-looks-better-every-day-ly y'rs - tim
(*) Simply by taking the damn "" off "hash" -- I explained quite a while
ago why that can lead to a weak form of clustering "in theory", and
instrumenting the dict lookup code confirmed that it does hurt in real life.
- Previous message: [Python-Dev] Hats off to them!
- Next message: [Python-Dev] Ill-defined encoding for CP875?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]