[Python-Dev] Byte string class hierarchy (original) (raw)

Jack Jansen Jack.Jansen at cwi.nl
Thu Aug 19 00:16:33 CEST 2004

Previous message: [Python-Dev] Re: PEP 318: Suggest we drop it
Next message: [Python-Dev] Byte string class hierarchy
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

I may have missed a crucial bit of the discussion, having been away, so if this is completely besides the point let me know. But my feeling is that the crucial bit is the type inheritance graph of all the byte and string types. And I wonder whether the following graph would help us solve most problems (aside from introducing one new one, that may be a showstopper):

genericbytes mutablebytes bytes genericstring string unicode

The basic type for all bytes, buffers and strings is genericbytes. This abstract base type is neither mutable nor immutable, and has the interface that all of the types would share. Mutablebytes adds slice assignment and such. Bytes, on the other hand, adds hashing. genericstring is the magic stuff that's there already that makes unicode and string interoperable for hashing and dict keys and such.

Casting to a basetype is always free and doesn't copy anything, i.e. the bits stay the same. 'foo' in sourcecode is a string, and if you cast it to bytes you'll just get the bits, which is pretty much the same as what you get now. If you really want to make sure you get an 8-bit ascii representation even if you run in an interpreter built with UCS4 as the default character set you must use bytes('foo'.encode('ascii')).

Casting to a subtype may cause a copy, but does not modify the bits. Casting sideways copies, and may modify the bits too, the current unicode encode/decode stuff. These 2 rules mean that unicode('foo') is something different from unicode(bytes('foo')), and probably illegal to boot, but I don't think that's too much of a problem: you shouldn't explicitly cast to bytes() unless you really want uninterpreted bits.

Operations like concatenation return the most specialised class. Mutablebytes is the only problem case here, we should probably forbid concatenating these with the others. The alternatives (return mutablebytes, return the other one, return the type of the first operand) all seem somewhat random.

Read() is guaranteed only to return genericbytes, but if you open a file in textmode they'll returns strings, and we should add the ability to open files for unicode and probably mutablebytes too. I'm not sure about socket.recv() and such, but something similar probably holds. Readline() really shouldn't be allowed on files open in binary mode, but that may be a bit too much.

Write and friends accept genericbytes, and binary files will just dump the bits. Files open in text mode may convert unicode and string objects between representations.

The bad news (aside from any glaring holes I may have overseen in the above: shoot away!) is that I don't know what to do for hash on bytes objects. On the one hand I would like hash('foo') == hash(bytes('foo')). But that leads to also wanting hash(u'foo') == hash(bytes(u'foo')), and we can't really have that because hash('foo') == hash(u'foo') is needed to make string/unicode interoperability for dictionaries work. Note that for the value 'foo' this isn't a problem, but for 'föö' (thats F O-UMLAUT O-UMLAUT) it is. So it seems that making hash('foo') != hash(bytes('foo')) is the only reasonable solution (and probably also a good idea with the future in mind: explicit is better than implicit so just put a cast there if you want the binary bits to be interpreted as an ASCII or Unicode string!) it will probably break existing code.

Jack Jansen, <Jack.Jansen at cwi.nl>, http://www.cwi.nl/~jack If I can't dance I don't want to be part of your revolution -- Emma Goldman

Previous message: [Python-Dev] Re: PEP 318: Suggest we drop it
Next message: [Python-Dev] Byte string class hierarchy
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list