[Python-Dev] RE: Ill-defined encoding for CP875? (original) (raw)

Tim Peters tim.one@home.com
Sat, 12 May 2001 17:48:38 -0400


[Martin v. Loewis, whose encyclopedic knowledge of encoding details still isn't enough to get a clear answer (it's like somebody asking me for a simple answer to a floating point question ]

... So I think we can take one of two approaches:

1. admit that CP 875 is not round-trippable, and exclude it from the test (although when looking at the first 128 characters only, it is round-trippable).

As I noted later, 875 is already excluded from the roundtrip test across range(128, 256). What it's failing is the roundtrip test across range(128): after unicode("?", "cp875") produces u'\x1a', the following .encode('c875') has no way to know which range the original input came from. So it's not really round-trippable across range(128) either unless more info is given to .encode().

2. remove the SUBSTITUTE mappings from CP875, acknowledging that apparently these characters have no meaning in that code page. Unfortunately, I could not find any official IBM documentation page that lists the characters supported in each of the EBCDIC code pages.

The second seems to be more corrrect to me, although it is a deviation from the Unicode consortium publications.

Until you and MAL agree on the best thing to do (I have no opinion: my only exposure to Unicode in daily programming life remains the Python test suite), I'm going to opt for #1: as cp875.py stands today, it's simply a fact that it's not round-trippable across any range including 0x3f.