[Python-Dev] PEP 461 Final? (original) (raw)

Neil Schemenauer nas at arctrix.com
Fri Jan 17 18:46:15 CET 2014


Ethan Furman <ethan at stoneleaf.us> wrote:

Overriding Principles =====================

In order to avoid the problems of auto-conversion and Unicode exceptions that could plague Py2 code, all object checking will be done by duck-typing, not by values contained in a Unicode representation [3].

I think a longer "Rational" section is justified given the amount of discussion this feature generated. Here is a revised version of what I already suggested:

Rational
========

A distruptive but useful change introduced in Python 3.0 was the
clean separation of byte strings (i.e. the "bytes" object) from
character strings (i.e. the "str" object).  The benefit is that
character encodings must be explicitly specified and the risk of
corrupting character data is reduced.

Unfortunately, this separation has made writing certain types of
programs more complicated and verbose.  For example, programs
that deal with network protocols often manipulate ASCII encoded
strings or assemble byte strings from fragments.  Since the
"bytes" type does not support string formatting, extra encoding
and decoding between the "str" type is often required.

For simplicity and convenience it is desireable to introduce
formatting methods to "bytes" that allow formatting of
ASCII-encoded character data.  This change would blur the clean
separation of byte strings and character strings.  However, it
is felt that the practical benefits outweigh the purity costs.
The implicit assumption of ASCII-encoding would be limited to
formatting methods.

One source of many problems with the Python 2 Unicode
implementation is the implicit coercion of Unicode character
strings into byte strings using the "ascii" codec.  If the
character strings contain only ASCII characters, all was well.
However, if the string contains a non-ASCII character then
coercion causes an exception.

The combination of implicit coercion and value dependent
failures has proven to be a recipe for hard to debug errors.  A
program may seem to work correctly when tested (e.g. string
input that happened to be ASCII only) but later would fail,
often with a traceback far from the source of the real error.
The formatting methods for bytes() should avoid this problem by
not implicitly encoding data that might fail based on the
content of the data.

I think we can back off on the duck-typing idea. It's a good Python principle but I now realize existing %-interpolation doesn't do it. The numeric format codes coerce to long or float.

Unsupported codes -----------------

%r (which calls repr), and %a (which calls ascii() on repr) are not supported.

I think %a should be supported. I imagine it would be quite useful when dumping debugging output to a bytes stream. It's easy to implement and I think the danger for abuse or surprises is small. It would also help when translating Python 2 code, change %r to %a.

Proposed variations ===================

It was suggested to let %s accept numbers, but since numbers have their own format codes this idea was discarded. It has been suggested to use %b for bytes instead of %s. - Rejected as %b does not exist in Python 2.x %-interpolation, which is why we are using %s.

I think we should use %b instead of %s. In that case, I'm fine with %b not accepting numbers. Using %b clearly indicates we are inserting arbitrary bytes. It also proves a useful code review step when translating from Python 2.x.

To ease porting from Python 2.x code, I propose adding a command-line option that enables %s and %r format codes for bytes %-interpolation. I'm going to write a draft PEP (it would depend on PEP 461 being implemented).

Originally this PEP also proposed adding format style formatting, but it was decided that format and its related machinery were all strictly text (aka str) based, and it was dropped.

I would also argue that we should limit the scope of this PEP. It has already generated a massive amount of discussion. Nothing precludes us from adding support for format() to bytes in the future, if we decide we want it and how it should work.

Various new special methods were proposed, such as ascii, formatbytes, etc.; such methods are not needed at this time, but can be visited again later if real-world use shows deficiencies with this solution.

I agree, new special methods are not needed at this time since numeric codes do use duck-typing and bytes already exists.

Neil



More information about the Python-Dev mailing list