[Python-Dev] PEP 461: Adding % formatting to bytes and bytearray -- Final, Take 3 (original) (raw)
Ethan Furman ethan at stoneleaf.us
Tue Mar 25 23:37:11 CET 2014
- Previous message: [Python-Dev] Status of PEP 3145 - Asynchronous I/O for subprocess.popen
- Next message: [Python-Dev] PEP 461: Adding % formatting to bytes and bytearray -- Final, Take 3
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Okay, I included that last round of comments (from late February).
Barring typos, this should be the final version.
Final comments?
PEP: 461 Title: Adding % formatting to bytes and bytearray Version: RevisionRevisionRevision Last-Modified: DateDateDate Author: Ethan Furman <ethan at stoneleaf.us> Status: Draft Type: Standards Track Content-Type: text/x-rst Created: 2014-01-13 Python-Version: 3.5 Post-History: 2014-01-14, 2014-01-15, 2014-01-17, 2014-02-22, 2014-03-25 Resolution:
Abstract
This PEP proposes adding % formatting operations similar to Python 2's str
type to bytes
and bytearray
[1]_ [2]_.
Rationale
While interpolation is usually thought of as a string operation, there are
cases where interpolation on bytes
or bytearrays
make sense, and the
work needed to make up for this missing functionality detracts from the overall
readability of the code.
Motivation
With Python 3 and the split between str
and bytes
, one small but
important area of programming became slightly more difficult, and much more
painful -- wire format protocols [3]_.
This area of programming is characterized by a mixture of binary data and
ASCII compatible segments of text (aka ASCII-encoded text). Bringing back a
restricted %-interpolation for bytes
and bytearray
will aid both in
writing new wire format code, and in porting Python 2 wire format code.
Common use-cases include dbf
and pdf
file formats, email
formats, and FTP
and HTTP
communications, among many others.
Proposed semantics for bytes
and bytearray
formatting
%-interpolation
All the numeric formatting codes (such as %x
, %o
, %e
, %f
,
%g
, etc.) will be supported, and will work as they do for str, including
the padding, justification and other related modifiers. The only difference
will be that the results from these codes will be ASCII-encoded text, not
unicode. In other words, for any numeric formatting code %x
::
b"%x" % val
is equivalent to
("%x" % val).encode("ascii")
Examples::
>>> b'%4x' % 10
b' a'
>>> b'%#4x' % 10
' 0xa'
>>> b'%04X' % 10
'000A'
%c
will insert a single byte, either from an int
in range(256), or from
a bytes
argument of length 1, not from a str
.
Examples::
>>> b'%c' % 48
b'0'
>>> b'%c' % b'a'
b'a'
%s
is included for two reasons: 1) b
is already a format code for
format
numerics (binary), and 2) it will make 2/3 code easier as Python 2.x
code uses %s
; however, it is restricted in what it will accept::
input type supports
Py_buffer
[6]_? use it to collect the necessary bytesinput type is something else? use its
__bytes__
method [7]_ ; if there isn't one, raise aTypeError
In particular, %s
will not accept numbers (use a numeric format code for
that), nor str
(encode it to bytes
).
Examples::
>>> b'%s' % b'abc'
b'abc'
>>> b'%s' % 'some string'.encode('utf8')
b'some string'
>>> b'%s' % 3.14
Traceback (most recent call last):
...
TypeError: b'%s' does not accept numbers, use a numeric code instead
>>> b'%s' % 'hello world!'
Traceback (most recent call last):
...
TypeError: b'%s' does not accept 'str', it must be encoded to `bytes`
%a
will call ascii()
on the interpolated value. This is intended
as a debugging aid, rather than something that should be used in production.
Non-ASCII values will be encoded to either \xnn
or \unnnn
representation. Use cases include developing a new protocol and writing
landmarks into the stream; debugging data going into an existing protocol
to see if the problem is the protocol itself or bad data; a fall-back for a
serialization format; or even a rudimentary serialization format when
defining __bytes__
would not be appropriate [8].
.. note::
If a ``str`` is passed into ``%a``, it will be surrounded by quotes.
Unsupported codes
%r
(which calls __repr__
and returns a str
) is not supported.
Proposed variations
It was suggested to let %s
accept numbers, but since numbers have their own
format codes this idea was discarded.
It has been suggested to use %b
for bytes as well as %s
. This was
rejected as not adding any value either in clarity or simplicity.
It has been proposed to automatically use .encode('ascii','strict')
for
str
arguments to %s
.
- Rejected as this would lead to intermittent failures. Better to have the operation always fail so the trouble-spot can be correctly fixed.
It has been proposed to have %s
return the ascii-encoded repr when the
value is a str
(b'%s' % 'abc' --> b"'abc'").
- Rejected as this would lead to hard to debug failures far from the problem site. Better to have the operation always fail so the trouble-spot can be easily fixed.
Originally this PEP also proposed adding format-style formatting, but it was
decided that format and its related machinery were all strictly text (aka
str
) based, and it was dropped.
Various new special methods were proposed, such as __ascii__
,
__format_bytes__
, etc.; such methods are not needed at this time, but can
be visited again later if real-world use shows deficiencies with this solution.
Objections
The objections raised against this PEP were mainly variations on two themes::
- the
bytes
andbytearray
types are for pure binary data, with no assumptions about encodings - offering %-interpolation that assumes an ASCII encoding will be an
attractive nuisance and lead us back to the problems of the Python 2
str
/unicode
text model
As was seen during the discussion, bytes
and bytearray
are also used
for mixed binary data and ASCII-compatible segments: file formats such as
dbf
and pdf
, network protocols such as ftp
and email
, etc.
bytes
and bytearray
already have several methods which assume an ASCII
compatible encoding. upper()
, isalpha()
, and expandtabs()
to name
just a few. %-interpolation, with its very restricted mini-language, will not
be any more of a nuisance than the already existing methods.
Some have objected to allowing the full range of numeric formatting codes with the claim that decimal alone would be sufficient. However, at least two formats (dbf and pdf) make use of non-decimal numbers.
Footnotes
.. [1] http://docs.python.org/2/library/stdtypes.html#string-formatting
.. [2] neither string.Template, format, nor str.format are under consideration
.. [3] https://mail.python.org/pipermail/python-dev/2014-January/131518.html
.. [4] to use a str object in a bytes interpolation, encode it first
.. [5] %c is not an exception as neither of its possible arguments are str
.. [6] http://docs.python.org/3/c-api/buffer.html
examples: memoryview
, array.array
, bytearray
, bytes
.. [7] http://docs.python.org/3/reference/datamodel.html#object._bytes_
.. [8] https://mail.python.org/pipermail/python-dev/2014-February/132750.html
Copyright
This document has been placed in the public domain.
.. Local Variables: mode: indented-text indent-tabs-mode: nil sentence-end-double-space: t fill-column: 70 coding: utf-8 End:
- Previous message: [Python-Dev] Status of PEP 3145 - Asynchronous I/O for subprocess.popen
- Next message: [Python-Dev] PEP 461: Adding % formatting to bytes and bytearray -- Final, Take 3
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]