[Python-Dev] PEP 461 updates (original) (raw)

Neil Schemenauer nas at arctrix.com
Thu Jan 16 17:42:58 CET 2014


Carl Meyer <carl at oddbird.net> wrote:

I think the PEP could really use a rationale section summarizing why these formatting operations are being added to bytes

I agree. My attempt at re-writing the PEP is below.

In order to avoid the problems of auto-conversion and value-generated exceptions, all object checking will be done via isinstance, not by values contained in a Unicode representation. In other words::

- duck-typing to allow/reject entry into a byte-stream - no value generated errors This seems self-contradictory; "isinstance" is type-checking, which is the opposite of duck-typing.

Again, I agree. We should avoid isinstance checks if possible.

Abstract

This PEP proposes adding %-interpolation to the bytes object.

Rational

A distruptive but useful change introduced in Python 3.0 was the clean separation of byte strings (i.e. the "bytes" object) from character strings (i.e. the "str" object). The benefit is that character encodings must be explicitly specified and the risk of corrupting character data is reduced.

Unfortunately, this separation has made writing certain types of programs more complicated and verbose. For example, programs that deal with network protocols often manipulate ASCII encoded strings. Since the "bytes" type does not support string formatting, extra encoding and decoding between the "str" type is required.

For simplicity and convenience it is desireable to introduce formatting methods to "bytes" that allow formatting of ASCII-encoded character data. This change would blur the clean separation of byte strings and character strings. However, it is felt that the practical benefits outweigh the purity costs. The implicit assumption of ASCII-encoding would be limited to formatting methods.

One source of many problems with the Python 2 Unicode implementation is the implicit coercion of Unicode character strings into byte strings using the "ascii" codec. If the character strings contain only ASCII characters, all was well. However, if the string contains a non-ASCII character then coercion causes an exception.

The combination of implicit coercion and value dependent failures has proven to be a recipe for hard to debug errors. A program may seem to work correctly when tested (e.g. string input that happened to be ASCII only) but later would fail, often with a traceback far from the source of the real error. The formatting methods for bytes() should avoid this problem by not implicitly encoding data that might fail based on the content of the data.

Another desirable feature is to allow arbitrary user classes to be used as formatting operands. Generally this is done by introducing a special method that can be implemented by the new class.

Proposed semantics for bytes formatting

Special method ascii

A new special method, analogous to format, is introduced. This method takes a single argument, a format specifier. The return value is a bytes object. Objects that have an ASCII only representation can implement this method to allow them to be used as format operators. Objects with natural byte representations should implement bytes or the Py_buffer API.

%-interpolation

All the numeric formatting codes (such as %x, %o, %e, %f, %g, etc.) will be supported, and will work as they do for str, including the padding, justification and other related modifiers. To avoid having to introduce two special methods, the format specifications will be translated to equivalent format specifiers and ascii method of each argument would be called.

Example::

b'%4x' % 10 b' a'

%c will insert a single byte, either from an int in range(256), or from a bytes argument of length 1.

Example:

>>> b'%c' % 48
b'0'

>>> b'%c' % b'a'
b'a'

%s is a restricted in what it will accept::

Examples:

>>> b'%s' % b'abc'
b'abc'

>>> b'%s' % 3.14
b'3.14'

>>> b'%4s' % 12
b'  12'

>>> b'%s' % 'hello world!'
Traceback (most recent call last):
...
TypeError: 'hello world' has no __ascii__ method, perhaps you need to encode it?

.. note::

Because the str type does not have a ascii method, attempts to directly use 'a string' as a bytes interpolation value will raise an exception. To use 'string' values, they must be encoded or otherwise transformed into a bytes sequence::

  'a string'.encode('latin-1')

Unsupported % format codes ^^^^^^^^^^^^^^^^^^^^^^^^^^

%r (which calls repr) is not supported

format

The format() method will not be implemented at this time but may be added in a later Python release. The ascii method is designed to make adding it later simpler.

Open Questions

Do we need to support the complete set of format codes? For complicated formatting perhaps using the str object to do the formatting and encoding the result is sufficient.

Should Python check that the bytes returned by ascii are in the range 0-127 (i.e. ASCII)? That seems of little utility since the error would be similar to a unicode-to-str coercion failure in Python 2 and the traceback would normally be far removed from the real error. Built-in types would be designed to never return non-ASCII characters from the ascii method.

Proposed variations

Instead of introducing a new special method, have numeric types implement bytes.

It has been suggested to use %b for bytes instead of %s.

It was suggested to disallow %s from accepting numbers.

It has been proposed to automatically use .encode('ascii','strict') for str arguments to %s.

It has been proposed to have %s return the ascii-encoded repr when the value is a str (b'%s' % 'abc' --> b"'abc'").

Instead of having %-interpolation call ascii, introduce a second special method analogous to str and have %s call it.

Copyright

This document has been placed in the public domain.

.. Local Variables: mode: indented-text indent-tabs-mode: nil sentence-end-double-space: t fill-column: 70 coding: utf-8 End:



More information about the Python-Dev mailing list