[Python-Dev] bytes type discussion (original) (raw)

Thomas Wouters thomas at xs4all.net
Wed Feb 15 01:24:46 CET 2006

Previous message: [Python-Dev] bytes type discussion
Next message: [Python-Dev] bytes type discussion
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Tue, Feb 14, 2006 at 03:13:25PM -0800, Guido van Rossum wrote:

Martin von Loewis's alternative for the "very controversial" set is to disallow an encoding argument and (I believe) also to disallow Unicode arguments. In 3.0 this would leave us with s.encode() as the only way to convert a string (which is always unicode) to bytes. The problem with this is that there's no code that works in both 2.x and 3.0.

Unless you only ever create (byte)strings by doing s.encode(), and only send them to code that is either byte/string-agnostic or -aware. Oh, and don't use indexing, only slicing (length-1 if you have to.) I guess it depends on howmuch code will accept a bytes-string where currently a string is the norm (and a unicode object is default-encoded.)

I'm still worried that all this is quite a big leap. Very few people understand the intricacies of unicode encodings. (Almost everyone understands unicode, except they don't know it yet; it's the encodings that are the problem.) By forcing everything to be unicode without a uniform encoding-detection scheme, we're forcing every programmer who opens a file or reads from the network to think about encodings. This will be a pretty big step for newbie programmers.

And it's not just that. The encoding of network streams or files may be entirely unknown beforehand, and depend on the content: a content-encoding, a HTML tag. Will bytes-strings get string methods for easy searching of content descriptors? Will the 're' module accept bytes-strings? What would the literals you want to search for, look like? Do I really do 'if bytes("Content-Type:") in data:' and such? Should data perhaps get read using the opentext() equivalent of 'decode('ascii', 'replace')' and then parsed the 'normal' way? What about data gotten from an extension? And nevermind what the 'right way' for that is; what will programmers do? The 'right way' often escapes them.

It may well be that I'm thinking too conservatively, too stuck in the old ways, but I think we're being too hasty in dismissing the ol' string. Don't get me wrong, I really like the idea of as much of Python doing unicode as possible, and the idea of a mutable bytes type sounds good to me too. I just don't like the wide gap between the troublesome-to-get unicode object and the unreadable-repr, weird-indexing, hard-to-work-with bytes-string. I don't think adding something inbetween is going to work (we basically have that now, the normal string), so I suggest the bytes-string becomes a bit more 'string' and a bit less 'sequence of bytes'. Perhaps in the form of:

A bytes type that repr()'s to something readable
A way to write byte literals that doesn't bleed the eyes, and isn't so fragile in the face of source-encoding (all the suggestions so far have you explicitly re-stating the source-encoding at each bytes("".encode())) If you have to wonder why that's fragile, just think about a recoding editor. Alternatively, get a short way to say 'encode in source-encoding'

(I can't think of anything better than b"..." for the above two... Except... hmm... didn't `` become available in Py3k? Too little visual distinction?)

A way to manipulation the bytes as character-strings. Pattern matching, splitting, finding, slicing, etc. Quite like current strings.
Disallowing any interaction between bytes and real (meaning 'unicode') strings. Not "oh, let's assume ascii or the default encoding", either. If the user wants to explicitly decode using 'ascii', that's their choice, but they should consciously make it.
Mutable or immutable, I don't know. I fear that if the bytes type was easy enough to handle and mutable, and the normal (unicode) strings were immutable, people may end up using bytes all the time. In fact, they may do that anyway; I'm sure Python will grow entire subcults that prefer doing 'string("\xa1Python!")' where 'string' is 'bytes(arg.encode("iso-8859-1"))'

Bytes should be easy enough to manipulate 'as strings' to do the basic tasks, but not easy enough to encourage people to forget about that whole annoying 'encoding' business and just use them instead (which is basically what we have now.) On the other hand, if people don't want to deal with that whole encoding business, we should allow them to -- consciously. We can offer a variety of hints and tips on how to figure out the encoding of something, but we can't do the thinking for them (trust me, I've tried.)

When a file's encoding is specified in file metadata, that's great, really great. When a network connection is handled by a library that knows how to deal with the content (coughTwistedcough) and can decode it for you, that's really great too. But we're not there yet, not by a long shot. And explaining encodings to a ADHD-infested teenager high on adrenalin and creative inspiration who just wants to connect to an IRC server to make his bot say "Hi!", well, that's hard. I'd rather they don't go and do PHP instead. Doing it right is hard, but it's even harder to do it all right the first time, and Python never really worried about that ;P

-- Thomas Wouters <thomas at xs4all.net>

Hi! I'm a .signature virus! copy me into your .signature file to help me spread!

Previous message: [Python-Dev] bytes type discussion
Next message: [Python-Dev] bytes type discussion
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list