[Python-Dev] PEP 460 reboot (original) (raw)

Guido van Rossum guido at python.org
Mon Jan 13 00:55:23 CET 2014

Previous message: [Python-Dev] Python advanced debug support (update frame code)
Next message: [Python-Dev] PEP 460 reboot
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

There's a lot of discussion about PEP 460 and I haven't read it all. Maybe you all have already reached the same conclusion that I have. In that case I apologize (but the PEP should be updated). Here's my contribution:

PEP 460 itself currently rejects support for %d, AFAIK on the basis that bytes aren't necessarily ASCII. I think that's a misunderstanding of the intention of the bytes type.

The key reason for introducing a separate bytes type in Python 3 is to avoid mixing bytes and text. This aims to avoid the classic Python 2 Unicode failure, where str+unicode fails or succeeds based on whether str contains non-ASCII characters or not, which means it is easy to miss in testing. Properly written code in Python 3 will fail based on the type of the objects, not based on their contents. Content-based failures are still possible, but they occur in typical "boundary" operations such as encode/decode.

But this does not mean the bytes type isn't allowed to have a noticeable bias in favor of encodings that are ASCII supersets, even if not all bytes objects contain such data (e.g. image data, compressed data, binary network packets, and so on).

IMO it's totally fine and consistent if b'%d' % 42 returns b'42' and also for b'{}'.format(42) to return b'42'. There are numerous places where bytes are already assumed to use an ASCII superset:

byte literals: b'abc' (it's a syntax error to have a non-ASCII character here)
the upper() and lower() methods modify the ASCII letter positions
int(b'42') == 42, float(b'3.14') == 3.14

I looked through the example code I recently write for asyncio (which uses bytes for all data read or written). There are several places where I have to make a clumsy detour via text strings because I need to include an ASCII-encoded decimal integer (e.g. the Content-Length header) or a hex-encoded one (e.g. for Transfer-Encoding: chunked). Those detours aren't needed for parsing because int() accepts bytes just fine.

I also note that the behavior of the re module is perfect: if the pattern is bytes, it can only match bytes and the extracted data is bytes, and ditto for text -- so it supports both types but doesn't allow mixing them. The urllib module does this too -- at considerable cost in its implementation, but it's the right thing, because there really are good cases to be made for treating URLs as text as well as for treating them as bytes (as with filenames, command line arguments, and environment variables).

I'm sad that the json module in Python 3 doesn't support bytes at all, but at least it is consistent -- it always produces text in ASCII encoding (by default). The same applies to the http module, which IIUC adheres to the standard by treating headers as Latin-1.

-- --Guido van Rossum (python.org/~guido)

Previous message: [Python-Dev] Python advanced debug support (update frame code)
Next message: [Python-Dev] PEP 460 reboot
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list