[Python-3000] Unicode and OS strings (original) (raw)

Stephen J. Turnbull stephen at xemacs.org
Sat Sep 15 05:44:05 CEST 2007


Greg Ewing writes:

Stephen J. Turnbull wrote:

You chose the context of round-tripping across encodings, not me. Please stick with your context.

Maybe we have different ideas of what the problem is. I thought the problem is to take arbitrary byte sequences coming in as command-line args and represent them as unicode strings in such a way that the can be losslessly converted back into the same byte strings.

That's a straw man if taken literally. Just use the ISO-8859-1 codec, and you're done.

If you add the condition that the encoding is known with certainty and the source string is well-formed for that encoding, then you need to decode to meaningful Unicode. For that problem, James Knight's solution is good if it makes sense to assume that the sequence of bytes is encoded in UTF-8 Unicode. However, I don't think that is a reasonable assumption for a language that is heavily used in Europe and Japan, and for processing email. These are contexts where UTF-8 is making steady progress, but legacy encodings are still quite important.

However, the general problem is to decode a sequence of bytes into a Unicode string and be able to recover the original sequence if you decide you got it wrong, even after you've sliced and concatenated the string with other strings. With no guarantee that all the source encodings where the same.

I was just pointing out that if you do this in a way that involves some sort of dynamically generated mapping, then it won't work if the round trip spans more than one Python session -- and that there are any number of ways that the data could get from one session to another, many of them not involving anything that one would recognise as a unicode encoding in the conventional sense.

But it also won't work if you just pass around strings that are invertible to byte sequences, because recipients don't know which byte sequence to invert them to. Is that cruft corrupt EUC-JP or corrupt Shift JIS or corrupt UTF-8? Or maybe simply a valid character which is even a Unicode character, but not in the table for the source encoding (this happens in Japanese all the time)? You're likely to make different guesses about what was intended by a specific sequence of byte cruft for different original encodings.

What I'm suggesting is to provide a way for processes to record and communicate that information without needing to provide a "source encoding" slot for strings, and which is able to handle strings containing unrecognized (including corrupt) characters from multiple source encodings. True, it will be up to the applications to communicate that information, but it is, anyway.

Furthermore, the same algorithms can be used to "fold" any text that contains only BMP characters plus no more than 6400 distinct non-BMP characters into the BMP, which would be a nice feature for people wanting to avoid the UTF-16 surrogates for some reason.

As Martin points out, it may not be possible to implement this without changing the codecs one by one (I have some hope that it can nevertheless be done, but haven't looked at the codec framework closely yet). I think it would be unfortunate if we're going to try to solve a small subset of these problems (as James and Marcin are doing) to overlook the possibility of a good solution to a whole bunch of related problems.



More information about the Python-3000 mailing list