[Python-Dev] Re: Ill-defined encoding for CP875? (original) (raw)

M.-A. Lemburg mal@lemburg.com
Sun, 13 May 2001 19:20:01 +0200


Tim Peters wrote:

I have a way to make dict lookup a teensy bit cheaper(*) that significantly reduces the number of collisions (which is much more valuable). This caused a number of std tests to fail, because they were implicitly relying on the order in which a dict's entries are materialized via .keys() or .items(). Most of these were easy enough to fix. The last failure remaining is testunicode, and I don't know how to fix it. It's dying here: try: verify(unicode(s,encoding).encode(encoding) == s) except TestFailed: print '*** codec "%s" failed round-trip' % encoding except ValueError,why: print '*** codec for "%s" failed: %s' % (encoding, why) when encoding == "cp875". There's a bogus problem you have to worm around first: testunicode neglected to import TestFailed, so it actually dies with NameError while trying the "except TestFailed" clause after verify() raises TestFailed. Once that's repaired, it's complaining about failing the round-trip encoding.

Ooops; this must have been caused by the assert statment removal in the test suite I hacked up some months ago. Funny that it never showed up... the code seems to be very robust ;-)

The original character in s it's griping about is "?" (0x3f). cp875.py has this entry in its decodingmap dict:

0x003f: 0x001a, # SUBSTITUTE But 0x1a is not a unique value in this dict. There's also 0x00dc: 0x001a, # SUBSTITUTE 0x00e1: 0x001a, # SUBSTITUTE 0x00ec: 0x001a, # SUBSTITUTE 0x00ed: 0x001a, # SUBSTITUTE 0x00fc: 0x001a, # SUBSTITUTE 0x00fd: 0x001a, # SUBSTITUTE Therefore what appears associated with 0x1a in the derived encodingmap dict: encodingmap = {} for k,v in decodingmap.items(): encodingmap[v] = k may end up being any of the 7 decodingmap keys that map to 0x1a. It just so happened to map back to 0x3f before, but to 0xfd after the dict change, so "?" doesn't survive the round trip anymore.

The "right" thing to do here, is to simply remove cp875 from the test for round-tripping. It is not the only encoding which fails this test, but it's not our fault: the codecs were all generated from the original codec maps at the Unicode.org site.

If their mappings are broken, we can't do much about it... other than to ignore the error or remove the codec altogether.

-- Marc-Andre Lemburg


Company & Consulting: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/