[Python-3000] PEP 3137: Immutable Bytes and Mutable Buffer (original) (raw)

Guido van Rossum guido at python.org
Wed Sep 26 23:58:53 CEST 2007


Please comment.

PEP: 3137 Title: Immutable Bytes and Mutable Buffer Version: Revision:58264Revision: 58264 Revision:58264 Last-Modified: Date:2007−09−2614:58:29−0700(Wed,26Sep2007)Date: 2007-09-26 14:58:29 -0700 (Wed, 26 Sep 2007) Date:2007092614:58:290700(Wed,26Sep2007) Author: Guido van Rossum <guido at python.org> Status: Draft Type: Standards Track Content-Type: text/x-rst Created: 26-Sep-2007 Python-Version: 3.0 Post-History: 26-Sep-2007

Introduction

After releasing Python 3.0a1 with a mutable bytes type, pressure mounted to add a way to represent immutable bytes. Gregory P. Smith proposed a patch that would allow making a bytes object temporarily immutable by requesting that the data be locked using the new buffer API from PEP 3118. This did not seem the right approach to me.

Jeffrey Yasskin, with the help of Adam Hupp, then prepared a patch to make the bytes type immutable (by crudely removing all mutating APIs) and fix the fall-out in the test suite. This showed that there aren't all that many places that depend on the mutability of bytes, with the exception of code that builds up a return value from small pieces.

Thinking through the consequences, and noticing that using the array module as an ersatz mutable bytes type is far from ideal, and recalling a proposal put forward earlier by Talin, I floated the suggestion to have both a mutable and an immutable bytes type. (This had been brought up before, but until seeing the evidence of Jeffrey's patch I wasn't open to the suggestion.)

Moreover, a possible implementation strategy became clear: use the old PyString implementation, stripped down to remove locale support and implicit conversions to/from Unicode, for the immutable bytes type, and keep the new PyBytes implementation as the mutable bytes type.

The ensuing discussion made it clear that the idea is welcome but needs to be specified more precisely. Hence this PEP.

Advantages

One advantage of having an immutable bytes type is that code objects can use these. It also makes it possible to efficiently create hash tables using bytes for keys; this may be useful when parsing protocols like HTTP or SMTP which are based on bytes representing text.

Porting code that manipulates binary data (or encoded text) in Python 2.x will be easier using the new design than using the original 3.0 design with mutable bytes; simply replace str with bytes and change '...' literals into b'...' literals.

Naming

I propose the following type names at the Python level:

The old type named buffer is so similar to the new type memoryview, introduce by PEP 3118, that it is redundant. The rest of this PEP doesn't discuss the functionality of memoryview; it is just mentioned here to justify getting rid of the old buffer type so we can reuse its name for the mutable bytes type.

While eventually it makes sense to change the C API names, this PEP maintains the old C API names, which should be familiar to all.

Literal Notations

The b'...' notation introduced in Python 3.0a1 returns an immutable bytes object, whatever variation is used. To create a mutable bytes buffer object, use buffer(b'...') or buffer([...]). The latter may use a list of integers in range(256).

Functionality

PEP 3118 Buffer API

Both bytes and buffer support the PEP 3118 buffer API. The bytes type only supports read-only requests; the buffer type allows writable and data-locked requests as well. The element data type is always 'B' (i.e. unsigned byte).

Constructors

There are four forms of constructors, applicable to both bytes and buffer:

Comparisons

The bytes and buffer types are comparable with each other and orderable, so that e.g. b'abc' == buffer(b'abc') < b'abd'.

Comparing either type to a str object raises an exception. This turned out to be necessary to catch common mistakes.

Slicing

Slicing a bytes object returns a bytes object. Slicing a buffer object returns a buffer object.

Slice assignment to a mutable buffer object accept anything that supports the PEP 3118 buffer API, or an iterable of integers in range(256).

Indexing

Open Issue: I'm undecided on whether indexing bytes and buffer objects should return small ints (like the bytes type in 3.0a1, and like lists or array.array('B')), or bytes/buffer objects of length 1 (like the str type). The latter (str-like) approach will ease porting code from Python 2.x; but it makes it harder to extract values from a bytes array.

Assignment to an item of a mutable buffer object accepts an int in range(256); if we choose the str-like approach for indexing above, it also accepts an object implementing the PEP 3118 buffer API, if it has length 1.

Str() and Repr()

The str() and repr() functions return the same thing for these objects. The repr() of a bytes object returns a b'...' style literal. The repr() of a buffer returns a string of the form "buffer(b'...')".

Methods

The following methods are supported by bytes as well as buffer, with similar semantics. They accept anything that implements the PEP 3118 buffer API for bytes arguments, and return the same type as the object whose method is called ("self")::

.capitalize(), .center(), .count(), .decode(), .endswith(), .expandtabs(), .find(), .index(), .isalnum(), .isalpha(), .isdigit(), .islower(), .isspace(), .istitle(), .isupper(), .join(), .ljust(), .lower(), .lstrip(), .partition(), .replace(), .rfind(), .rindex(), .rjust(), .rpartition(), .rsplit(), .rstrip(), .split(), .splitlines(), .startswith(), .strip(), .swapcase(), .title(), .translate(), .upper(), .zfill()

This is exactly the set of methods present on the str type in Python 2.x, with the exclusion of .encode(). The signatures and semantics are the same too. However, whenever character classes like letter, whitespace, lower case are used, the ASCII definitions of these classes are used. (The Python 2.x str type uses the definitions from the current locale, settable through the locale module.) The .encode() method is left out because of the more strict definitions of encoding and decoding in Python 3000: encoding always takes a Unicode string and returns a bytes sequence, and decoding always takes a bytes sequence and returns a Unicode string.

Bytes and the Str Type

Like the bytes type in Python 3.0a1, and unlike the relationship between str and unicode in Python 2.x, any attempt to mix bytes (or buffer) objects and str objects without specifying an encoding will raise a TypeError exception. This is the case even for simply comparing a bytes or buffer object to a str object (even violating the general rule that comparing objects of different types for equality should just return False).

Conversions between bytes or buffer objects and str objects must always be explicit, using an encoding. There are two equivalent APIs: str(b, <encoding>[, <errors>]) is equivalent to b.encode(<encoding>[, <errors>]), and bytes(s, <encoding>[, <errors>]) is equivalent to s.decode(<encoding>[, <errors>]).

There is one exception: we can convert from bytes (or buffer) to str without specifying an encoding by writing str(b). This produces the same result as repr(b). This exception is necessary because of the general promise that any object can be printed, and printing is just a special case of conversion to str. There is however no promise that printing a bytes object interprets the individual bytes as characters (unlike in Python 2.x).

The str type current supports the PEP 3118 buffer API. While this is perhaps occasionally convenient, it is also potentially confusing, because the bytes accessed via the buffer API represent a platform-depending encoding: depending on the platform byte order and a compile-time configuration option, the encoding could be UTF-16-BE, UTF-16-LE, UTF-32-BE, or UTF-32-LE. Worse, a different implementation of the str type might completely change the bytes representation, e.g. to UTF-8, or even make it impossible to access the data as a contiguous array of bytes at all. Therefore, support for the PEP 3118 buffer API will be removed from the str type.

Copyright

This document has been placed in the public domain.

.. Local Variables: mode: indented-text indent-tabs-mode: nil sentence-end-double-space: t fill-column: 70 coding: utf-8 End:

-- --Guido van Rossum (home page: http://www.python.org/~guido/)



More information about the Python-3000 mailing list