[Python-Dev] RE: Ill-defined encoding for CP875? (original) (raw)

M.-A. Lemburg mal@lemburg.com
Tue, 15 May 2001 10:32:14 +0200


Tim Peters wrote:

[M.-A. Lemburg] > The problem is: which part would raise the exception -- the > encoder or the decoder ? Since I don't yet use any of this stuff for real, I have no idea: seems mostly a question of pragmatics, and I don't have any feel for how cp875 users would view it.

If there are any... that code page dates back to 1996 and is based in the EBCDIC world.

> Here are some more options: > > * sort the items before creating the encoding table from the > decoding one (makes the mapping stable)

If users don't care that round-trip can fail silently, fine. > * map keys which have multiple mappings in the encoding table > to None -- this causes their usage to raise an exception > (undefined mapping) If users don't care that they'll get an exception when they try something that can't be round-tripped, fine. Or would this depend on the value of the "errors" argument too? Then it's easier to impose.

The errors argument tells the codecs what to do in case a mapping fails (from codecs.py):

    The .encode()/.decode() methods may implement different error
    handling schemes by providing the errors argument. These
    string values are defined:

     'strict' - raise a ValueError error (or a subclass)
     'ignore' - ignore the character and continue with the next
     'replace' - replace with a suitable replacement character;
                Python will use the official U+FFFD REPLACEMENT
                CHARACTER for the builtin Unicode codecs.

'strict' is the default for all operations that deal with auto- conversion. 'ignore' and 'replace' allow silently ignoring the problem.

There's a theme here : I have no idea how important roundtrip is in Unicode Practice, or even that it's a constant across apps and encodings. If I write a codec to map all ASCII consonants to u"k" and vowels to u"a", I wouldn't care that I can't get "love" back from u"kaka" .

Round-tripping is obviously very important if you use Unicode as basis for working on text. I don't know about the reasoning behind making cp875 fail the round-trip -- Unicode certainly provides means to make mappings round-trip safe (e.g. by reverting to the private Unicode char. point areas).

-- Marc-Andre Lemburg


Company & Consulting: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/