[Python-Dev] methods on the bytes object (original) (raw)

Josiah Carlson jcarlson at uci.edu
Mon May 1 06:19:04 CEST 2006


"Martin v. Löwis" <martin at v.loewis.de> wrote:

Josiah Carlson wrote: >> I think what you are missing is that algorithms that currently operate >> on byte strings should be reformulated to operate on character strings, >> not reformulated to operate on bytes objects. > > By "character strings" can I assume you mean unicode strings which > contain data, and not some new "character string" type? I mean unicode strings, period. I can't imagine what "unicode strings which do not contain data" could be.

Binary data as opposed to text. Input to a array.fromstring(), struct.unpack(), etc.

> I know I must > have missed some conversation. I was under the impression that in Py3k: > > Python 1.x and 2.x str -> mutable bytes object

No. Python 1.x and 2.x str -> str, Python 2.x unicode -> str In addition, a bytes type is added, so that Python 1.x and 2.x str -> bytes The problem is that the current string type is used both to represent bytes and characters. Current applications of str need to be studied, and converted appropriately, depending on whether they use "str-as-bytes" or "str-as-characters". The "default", in some sense of that word, is that str applications are assumed to operate on character strings; this is achieved by making string literals objects of the character string type.

Certainly it is the case that right now strings are used to contain 'text' and 'bytes' (binary data, encodings of text, etc.). The problem is in the ambiguity of Python 2.x str containing text where it should only contain bytes. But in 3.x, there will continue to be an ambiguity, as strings will still contain bytes and text (parsing literals, see the somewhat recent argument over bytes.encode('base64'), etc.). We've not removed the problem, only changed it from being contained in non-unicode strings to be contained in unicode strings (which are 2 or 4 times larger than their non-unicode counterparts).

Within the remainder of this email, there are two things I'm trying to accomplish:

  1. preserve the Python 2.x string type
  2. make the bytes object more pallatable regardless of #1

The current plan (from what I understand) is to make all string literals equivalent to their Python 2.x u-prefixed equivalents, and to leave u-prefixed literals alone (unless the u prefix is being removed?). I won't argue because I think it is a great idea.

I do, however, believe that the Python 2.x string type is very useful from a data parsing/processing perspective. Look how successful and effective it has been so far in the history of Python. In order to make the bytes object be as effective in 3.x, one would need to add basically all of the Python 2.x string methods to it (having some mechanism to use slices of bytes objects as dictionary keys (if data[:4] in handler: ... -> if tuple(data[:4]) in handler: ... ?) would also be nice). Of course, these implementations, ultimately, already exist with Python 2.x immutable strings.

So, what to do? Rename Python 2.x str to bytes. The name of the type now confers the idea that it should contain bytes, not strings. If bytes literals are deemed necessary (I think they would be nice, but not required), have b"..." as the bytes literal. Not having a literal, I think, will generally reduce the number of people who try to put text into bytes.

Ahh, but what about the originally thought-about bytes object? That mutable, file-like, string-like thing which is essentially array.array ('B', ...) with some other useful stuff? Those are certainly still useful, but not so much from a data parsing/processing perspective, as much as a mutable in-memory buffer (not the Python built-in buffer object, but a C-equivalent char* = (char*)malloc(...); ). I currently use mmaps and array objects for that (to limited success), but a new type in the collections module (perhaps mutablebytes?) which offers such functionality would be perfectly reasonable (as would moving the immutable bytes object if it lacked a literal; or even switch to bytes/frozenbytes).

If we were to go to the mutable/immutable bytes object pair, we could still give mutable bytes .read()/.write(), slice assignment, etc., and even offer an integer view mechanism (for iteration, assignment, etc.). Heck, we could do the same thing for the immutable type (except for .write(), assignment, etc.), and essentially replace cStringIO(initializer) (of course mutable bytes effectively replace cStringIO()).

> and that there would be some magical argument > to pass to the file or open open(fn, 'rb', magicalparameter).read() -> > bytes.

I think the precise details of that are still unclear. But yes, the plan is to have two file modes: one that returns character strings (type 'str') and one that returns type 'bytes'.

Here's a thought; require 'b' or 't' as arguments to open/file, the 't' also having an optional encoding argument (which defaults to the current default encoding). If one attempts to write bytes to a text file or if one attempts to write text to a bytes file; IOError, "Cannot write bytes to a text file" or "Cannot write text to a bytes file". Passing an encoding to the 'b' file could either raise an exception, or provide an encoding for text writing (removing the "Cannot write text to a bytes file"), though I wouldn't want to do any encoding by default for this case.

If there are mutable/immutable bytes as I describe above, reads on such could produce either, but only one of the two (immutable seems reasonable, at least from a consistancy perspective), but writes could take either (or even buffer()s).

> I mention this because I do binary data handling, some ''.join(...) for > IO buffers as Guido mentioned (because it is the fastest string > concatenation available in Python 2.x), and from this particular > conversation, it seems as though Python 3.x is going to lose > some expressiveness and power.

You certainly need a "concatenate list of bytes into a single bytes". Apparently, Guido assumes that this can be done through bytes().join(...); I personally feel that this is over-generalization: if the only practical application of .join is the empty bytes object as separator, I think the method should be omitted. bytes(...) bytes.join(...)

I don't know if the only use-case for bytes would be ''.join() (all of mine happen to be; non-''.join() cases are text), but I don't see the motivator for only allowing that particular use. The difference is an increment in the implementation; type checking and data copying should be more significant.



More information about the Python-Dev mailing list