[Python-Dev] PEP 460 reboot (original) (raw)

Nick Coghlan ncoghlan at gmail.com
Mon Jan 13 07:51:17 CET 2014

Previous message: [Python-Dev] PEP 460 reboot
Next message: [Python-Dev] PEP 460 reboot
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 13 January 2014 09:55, Guido van Rossum <guido at python.org> wrote:

There's a lot of discussion about PEP 460 and I haven't read it all. Maybe you all have already reached the same conclusion that I have. In that case I apologize (but the PEP should be updated). Here's my contribution:

PEP 460 itself currently rejects support for %d, AFAIK on the basis that bytes aren't necessarily ASCII. I think that's a misunderstanding of the intention of the bytes type. The key reason for introducing a separate bytes type in Python 3 is to avoid mixing bytes and text. This aims to avoid the classic Python 2 Unicode failure, where str+unicode fails or succeeds based on whether str contains non-ASCII characters or not, which means it is easy to miss in testing. Properly written code in Python 3 will fail based on the type of the objects, not based on their contents. Content-based failures are still possible, but they occur in typical "boundary" operations such as encode/decode. But this does not mean the bytes type isn't allowed to have a noticeable bias in favor of encodings that are ASCII supersets, even if not all bytes objects contain such data (e.g. image data, compressed data, binary network packets, and so on).

I am a strong -1 on the more lenient proposal, as it makes binary interpolation in Python 3 an unsafe operation for ASCII incompatible binary formats.

The existing binary operations that assume ASCII do so inherently - they're not input driven, the operation itself assumes ASCII, so if you're working with data that may not be ASCII compatible, you simply don't use them (these are operations like title(), upper(), lower(), the default arguments for split() and strip(), etc). They don't accept text or other structured data as input - you have to provide existing binary data or individual byte values (or, in the case of split(), strip(), the special value None to indicate the assumption of ASCII whitespace).

With PEP 460 as it stands, binary interpolation is safe - you can't implicitly introduce an ASCII assumption, regardless of the format string or input data, as everything that hasn't already been translated to the binary domain will be rejected with a TypeError. By allowing format characters that do assume ASCII, the entire construct is rendered unsafe - you have to look inside the format string to determine if it is assuming ASCII compatibility or not, thus the entire construct must be deemed as assuming ASCII compatibility at the level of static semantic analysis.

The more lenient proposal also creates an ambiguity about what it means to pass an integer to a binary formatting operation - is it about inserting individual byte values in the range 0-255, or is it about inserting the ASCII encoded digits of arbitrary byte strings, or does it depend on which formatting code you use? PEP 460 is currently entirely consistent with the other binary operations (it only accepts integers in the 0-255 range and interprets them as byte values), while the more lenient approach goes for the "it depends on the formatting code" alternative.

Allowing these ASCII assuming format codes in the core bytes interpolation introduces exactly the same problem as is present in the Python 2 text model: code that appears to support arbitrary binary data, but is in fact assuming ASCII compatibility. So any code that has to handle ASCII incompatible encodings will need to be implemented with the warning "don't use any of the binary formatting operations for data that may not be ASCII compatible, but we also don't provide a convenient equivalent that can be guaranteed to be safe so we know you're going to ignore this warning and do it anyway". That kind of "don't do that, it may cause problems with certain inputs" is exactly the kind of bug magnet that the Python 3 transition was designed to categorically eliminate.

PEP 460 is perfect in that regard - it provides exactly as much functionality as can be done correctly when manipulating arbitrary binary data, and no more. It has no trace of the legacy Python 2 text model.

However, I also accept that the Python 2 text model is convenient for certain use cases. This is why, in addition to PEP 460 as it currently stands, I am also (with Benno Rice) one of the instigators of the asciicompat project, and have promised Benno that I will ensure that any interoperability bugs asciicompat.asciistr uncovers in the core types are fixed (for Python 3.3+, since it depends on the PEP 393 internal representation for strings). asciistr will provide a public API that behaves exactly like a text type (including interoperating with strings and returning length 1 substrings when indexing, intepreting integers and other numeric types as their ASCII representation when passed in, supporting full text formatting semantics), but also exists in the binary domain, by exporting the bytes view of its internal data through the PEP 3118 buffer API.

In this way, asciistr will be a new general purpose mechanism for translating between the binary and text domains in Python 3, just like str.encode, bytes/bytearray/memoryview.decode and the struct module. It doesn't need to compromise - it's objectives are to make working with ASCII compatible binary protocols and writing hybrid binary/text APIs exactly as convenient as it was in Python 2, because that's where the test suite is developed: in Python 2, using "asciistr=str". It just doesn't need to be a builtin and, at this point in time, doesn't even need to be in the standard library. It can be developed on GitHub and published on PyPI and made available for Python 3.3 and above (it's also trivially 2.x compatible: there, it just republishes the str builtin as asciicompat.asciistr)

Once asciistr is working, we can also look into creating "asciicompat.asciiview", which would be a PEP 3118 consumer in addition to a publisher, and provide asciistr functionality for existing binary data, without needing to copy it.

ASCII compatible protocols are special and are worthy of having a dedicated type devoted to handling them. However, it shouldn't be at the expense of compromising the ability of Python 3 users to ensure that they aren't accidentally introducing assumptions of ASCII compatibility where they don't belong, particularly when doing so produces a clearly inferior solution. The superior solution looks like this:

bytes/bytearray/memoryview: pure binary types, operate entirely in the binary domain. They provide convenience operations that are only valid for ASCII compatible data, but the ASCII assumption is inherent in the operation itself rather than being input driven (the one minor exception being that passing None to split() and strip() operations assumes ASCII whitespace).
asciicompat.asciistr: hybrid type that exposes a text API in the application domain, but also exposes binary data directly for binary interoperability
str: pure text type, operates entirely in the application domain.

This approach also opens up the possibility of eventually leveraging PEP 393 to provide an asciicompat.utf8str type which allows arbitrary unicode characters and exports the UTF-8 representation, rather than restricting the permitted code points to 7-bit ASCII, as well as an asciicompat.latin1str which permits arbitrary 8 bit data (representing it as latin-1 text in the application domain), or even an asciicompat.encodedstr that supports any 8-bit encoding.

The key thing that the text model change in Python 3 enabled is for us to use the type system to help with managing the complexity of dealing with text encodings. We've got a long way with just the two pure types, and no additional types that straddle the binary/text boundary the way the Python 2 str type did. Unlike introducing new ASCII-only operations to the bytes type, adding new types specifically for dealing with ASCII compatible formats (especially starting life as a third party library) isn't compromising the Python 3 text model, it's embracing it and making it work for us (which is why I've been suggesting that it be considered since at least 2010). The problem with "str" in Python 2 was that one type was used to represent too many things with serious semantic differences.

The ongoing attempts to reintroduce that ambiguity to the core bytes type rather than exploring the creation of new types and then filing bugs for any interoperability issues those attempts uncover in the core types represents one of the worst cases of paradigm lock that I have ever seen :P

Regards, Nick.

-- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia

Previous message: [Python-Dev] PEP 460 reboot
Next message: [Python-Dev] PEP 460 reboot
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list