[Python-3000] Unicode and OS strings (original) (raw)

Stephen J. Turnbull stephen at xemacs.org
Fri Sep 14 10:56:24 CEST 2007

Previous message: [Python-3000] Unicode and OS strings
Next message: [Python-3000] Unicode and OS strings
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Greg Ewing writes:

Stephen J. Turnbull wrote:

You can't win that, because Unicode is the only encoding that attempts to guarantee even the possibility of round-tripping.

Rubbish -- I can do print [ord(c) for c in my_unicode_string] and get perfect round-trippability if I want.

Speaking of rubbish. You chose the context of round-tripping across encodings, not me. Please stick with your context.

You can ask people to use pre-existing officially-sanctioned encodings for their unicode data, but you can't force them to.

A wide variety of encodings, some standard and some not, and not necessarily with a known injection into Unicode, is precisely what I'm trying to deal with. None of the other proposals, except maybe Martin's, do. James Knight's proposal as it stands assumes UTF-8 Unicode, while Marcin Kowalczyk's just punts to treating everything unknown as a sequence of code units AFAICS.

The main problem with this scheme that I know of is that if you have a Python string that contains such a code point, you'll need to somehow include the information about the original encoding when pickling and the like.

I was merely admitting that getting it to work efficiently and backward-compatibly for pickling will be tricky. But it's trivial to get it to work reliably.

That's exactly the sort of thing I'm talking about. It would be surprising if pickling worked reliably for all strings except ones that happened to come in as a command line argument.

Um, no, it's not what you're talking about. Pickling is not currently reliable for strings that come in as command line arguments because Python is not reliable. That's precisely what we're trying to fix. None of the proposals make things worse, since they only apply in cases where the codec would throw an exception or incorrectly decode the argument anyway.

Yes, you could improve reliability in this sense by storing those strings as bytes, rather than trying to make better encoding guesses and storing "debugging info" about undecodable input. But surely using bytes objects is a non-starter; users are going to expect that command-line arguments are strings, not bytes, and ASCII-only users will raise hell if you ask them to explicitly invoke codecs to translate command-line arguments to strings so that they can be used.

Previous message: [Python-3000] Unicode and OS strings
Next message: [Python-3000] Unicode and OS strings
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-3000 mailing list