[Python-Dev] Ill-defined encoding for CP875? (original) (raw)

Tim Peters tim.one@home.com
Sat, 12 May 2001 17:22:49 -0400


[/F]

reverse sorting makes sense to me. but the cp-files appear to be machine generated, so patching that python file won't help.

Agreed.

a truly future-proof solution would be to specify exactly how to resolve every many-to-one mapping, for every font having that problem. but sorting them is clearly better than relying on implementation-dependent behaviour...

The attached program suggests the problem is rare; of those encoding files that have a Python decode_map dict, only these triggered a meaningful ambiguity complaint:

*** cp1006.py maps 0xfe8e back to 0xb1, 0xb2 *** cp875.py maps 0x1a back to 0x3f, 0xdc, 0xe1, 0xec, 0xed, 0xfc, 0xfd

Then since test_unicode only checks for roundtrip across range(0x80), cp875 is the only one that can fail (the ambiguities in cp1006 are for points > 0x7f, so aren't tested here).

Hmm! Now I see that in a part of test_unicode that wasn't reached, cp875 and cp1006 are excluded, with this comment:

### These fail the round-trip:
#'cp1006', 'cp875', 'iso8859_8',

So the practical hack for now is to exclude cp875 from the earlier range(128) roundtrip test too.

(is Jython using exactly the same hashing and dictionary algorithms as CPython? or does it work by accident also under Jython?)

Sorry, no idea. Attempting to browse the Jython source on SourceForge caused this cute behavior:

http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/jython/jython/Lib/

Python Exception Occurred

Traceback (innermost last):
  File "/usr/lib/cgi-bin/viewcvs.cgi", line 2286, in ?
    main()
  File "/usr/lib/cgi-bin/viewcvs.cgi", line 2253, in main
    view_directory(request)
  File "/usr/lib/cgi-bin/viewcvs.cgi", line 1043, in view_directory
    fileinfo, alltags = get_logs(full_name, rcs_files, view_tag)
  File "/usr/lib/cgi-bin/viewcvs.cgi", line 987, in get_logs
    raise 'error during rlog: '+hex(status)
error during rlog: 0x100

let's-rewrite-it-in-php-ly y'rs - tim

ENCODING_DIR = "../Lib/encodings"

import os import imp

def d(w): if type(w) is type(6): return hex(w) else: return repr(w)

encfiles = [name for name in os.listdir(ENCODING_DIR) if name.endswith(".py") and name[0] != "_"]

for fname in encfiles: path = os.path.join(ENCODING_DIR, fname) f = open(path) module = imp.load_source(fname[:-3], path, f) f.close() decode = getattr(module, "decoding_map", None) if decode is None: print fname, "doesn't have decoding_map." continue vtok = {} for k, v in decode.items(): if v in vtok: vtok[v].append(k) else: vtok[v] = [k] ambiguous = [(v, ks) for v, ks in vtok.items() if len(ks) > 1] if ambiguous: for v, ks in ambiguous: ks.sort() print "***", fname, "maps", d(v), "back to",
", ".join(map(d, ks)) else: print fname, "is free of ambiguous reverse maps."