[Python-Dev] PEP 461 Final? (original) (raw)

Fri Jan 17 18:46:15 CET 2014

Rational
========

A distruptive but useful change introduced in Python 3.0 was the
clean separation of byte strings (i.e. the "bytes" object) from
character strings (i.e. the "str" object).  The benefit is that
character encodings must be explicitly specified and the risk of
corrupting character data is reduced.

Unfortunately, this separation has made writing certain types of
programs more complicated and verbose.  For example, programs
that deal with network protocols often manipulate ASCII encoded
strings or assemble byte strings from fragments.  Since the
"bytes" type does not support string formatting, extra encoding
and decoding between the "str" type is often required.

For simplicity and convenience it is desireable to introduce
formatting methods to "bytes" that allow formatting of
ASCII-encoded character data.  This change would blur the clean
separation of byte strings and character strings.  However, it
is felt that the practical benefits outweigh the purity costs.
The implicit assumption of ASCII-encoding would be limited to
formatting methods.

One source of many problems with the Python 2 Unicode
implementation is the implicit coercion of Unicode character
strings into byte strings using the "ascii" codec.  If the
character strings contain only ASCII characters, all was well.
However, if the string contains a non-ASCII character then
coercion causes an exception.

The combination of implicit coercion and value dependent
failures has proven to be a recipe for hard to debug errors.  A
program may seem to work correctly when tested (e.g. string
input that happened to be ASCII only) but later would fail,
often with a traceback far from the source of the real error.
The formatting methods for bytes() should avoid this problem by
not implicitly encoding data that might fail based on the
content of the data.