[Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?] (original) (raw)

Guido van Rossum guido at python.org
Wed Feb 15 00:13:33 CET 2006


On 2/14/06, Thomas Wouters <thomas at xs4all.net> wrote:

On Mon, Feb 13, 2006 at 03:44:27PM -0800, Guido van Rossum wrote:

> But adding an encoding doesn't help. The str.encode() method always > assumes that the string itself is ASCII-encoded, and that's not good > enough: > >>> "abc".encode("latin-1") > 'abc' > >>> "abc".decode("latin-1") > u'abc' > >>> "abc\xf0".decode("latin-1") > u'abc\xf0' > >>> "abc\xf0".encode("latin-1") > Traceback (most recent call last): > File "", line 1, in ? > UnicodeDecodeError: 'ascii' codec can't decode byte 0xf0 in position > 3: ordinal not in range(128)

(Note that I've since been convinced that bytes(s) where type(s) == str should just return a bytes object containing the same bytes as s, regardless of encoding. So basically you're preaching to the choir now. The only remaining question is what if anything to do with an encoding argment when the first argument is of type str...)

These comments disturb me. I never really understood why (byte) strings grew the 'encode' method, since 8-bit strings are already encoded, by their very nature. I mean, I understand it's useful because Python does non-unicode encodings like 'hex', but I don't really understand why. The benefits don't seem to outweigh the cost (but that's hindsight.)

It may also have something to do with Jython compatibility (which has str and unicode being the same thing) or 3.0 future-proofing.

Directly encoding a (byte) string into a unicode encoding is mostly useless, as you've shown. The only use-case I can think of is translating ASCII in, for instance, EBCDIC. Encoding anything into an ASCII superset is a no-op, unless the system encoding isn't 'ascii' (and that's pretty rare, and not something a Python programmer should depend on.) On the other hand, the fact that (byte) strings have an 'encode' method creates a lot of confusion in unicode-newbies, and causes programs to break only when input is non-ASCII. And non-ASCII input just happens too often and too unpredictably in 'real-world' code, and not enough in European programmers' tests ;P

Oh, there are lots of ways that non-ASCII input can break code, you don't have to invoke encode() on str objects to get that effect. :/

Unicode objects and strings are not the same thing. We shouldn't treat them as the same thing.

Well in 3.0 they will be the same thing, and in Jython they already are.

They share an interface (like lists and tuples do), and if you only use that interface, treating them as the same kind object is mostly ok. They actually share less of an interface than lists and tuples, though, as comparing strings to unicode objects can raise an exception, whereas comparing lists to tuples is not expected to.

No, it causes silent surprises since [1,2,3] != (1,2,3).

For anything less trivial than indexing, slicing and most of the string methods, and anything what so ever involving non-ASCII (or, rather, non-system-encoding), unicode objects and strings must be treated separately. For instance, there is no correct way to do:

s.split("\x80") unless you know the type of 's'. If it's unicode, you want u"\x80" instead of "\x80". If it's not unicode, splitting "\x80" may not even be sensible, but you wouldn't know from looking at the code -- maybe it expects a specific encoding (or encoding family), maybe not. As soon as you deal with unicode, you need to really understand the concept, and too many programmers don't. And it's very hard to tell from someone's comments whether they fail to understand or just get some of the terminology wrong; that's why Guido's comments about 'encoding a byte string' and 'what if the file encoding is Unicode' scare me. The unicode/string mixup almost makes me wish Python was statically typed.

I'm mostly trying to reflect various broken mental models that users may have. Believe me, my own confusion is nothing compared to the confusion that occurs in less gifted users. :-)

The only use case for mixing ASCII and Unicode that I wanted to work right was the mixing of pure ASCII strings (typically literals) with Unicode data. And that works.

Where things unfortunately fall flat is when you start reading data from files or interactive input and it gives you some encoded str object instead of a Unicode object. Our mistake was that we didn't foresee this clearly enough. Perhaps open(filename).read(), where the file contains non-ASCII bytes, should have been changed to either return a Unicode string (if an encoding can somehow be guessed), or raise an exception, rather than returning an str object in some unknown (and usually unknowable) encoding.

I hope to fix that in 3.0 too, BTW.

So please, please, please don't make the mistake of 'doing something' with the 'encoding' argument to 'bytes(s, encoding)' when 's' is a (byte) string. It wouldn't actually be usable except for the same things as 'str.encode': to convert from ASCII to non-ASCII-supersets, or to convert to non-unicode encodings (such as 'hex'.) You can achieve those two by doing, e.g., 'bytes(s.encode('hex'))' if you really want to. Ignoring the encoding (rather than raising an exception) would also allow code to be trivially portable between Python 2.x and Py3K, when "" is actually a unicode object.

Not that I'm happy with ignoring anything, but not ignoring would be bigger crime here.

I'm beginning to see that this is a pretty reasonable interpretation.

Oh, and while on the subject, I'm not convinced going all-unicode in Py3K is a good idea either, but maybe I should save that discussion for PyCon. I'm not thinking "why do we need unicode" anymore (which I did two years ago ;) but I am thinking it'll be a big step for 90% of the programmers if they have to grasp unicode and encodings to be able to even do 'rawinput()' sensibly. I know I spend an inordinate amount of time trying to explain the basics on #python on irc.freenode.net already.

I'm actually hoping that by having all strings be Unicode we'd reduce the amount of confusion. The key (see above where I admitted this as our biggest Unicode mistake) is to make sure that the encoding/decoding is built into all I/O operations.

-- --Guido van Rossum (home page: http://www.python.org/~guido/)



More information about the Python-Dev mailing list