[Python-Dev] Python 3.x and bytes (original) (raw)

Stephen J. Turnbull stephen at xemacs.org
Thu May 19 10:00:24 CEST 2011


Robert Collins writes:

Thats separate to the implementation issues I have mentioned in this thread and previous.

Oops, sorry.

Nevertheless, I personally think that b'a'[0] == 97 is a good idea, and consistent with everything else in Python. It's Unicode (str) that is weird, it's str is surprising when first encountered by a C or Lisp programmer at first, but not enough to cause a heart attack given how weird natural language is. But I don't see why that weirdness (an element of LIST of TYPE is a LIST of TYPE, hey, young man, you're very smart but it's turtles all the way down!) should be replicated elsewhere.

If you want your bytes object to behave like a str, it's very easy to get that (.decode('latin1')), and nobody has yet demonstrated that this is too time-inefficient for real work, given the other overhead imposed by Python. The space inefficiency could be dealt with as Greg points out (by internally having a Unicode representation using 1 byte instead of 2 or 4). But if you want your bytes object to be a string, then you're confused. It isn't (any more). Even if it's just a matter of flipping one bit in the type field, a str-with-unibyte- representation, is not equal to a bytes object with the same bytes.

For example, you write:

urlparse converting bytes to 'str' to operate on them is at best a kludge - you're forcing 5 times the storage (the original bytes + 4 bytes-per-byte when its decoded into unicode) to work on something which is defined as a BNF * that uses ascii *.

Indeed it (RFC 3896) does use ASCII. But I think there is confusion in your words. This is what the RFC says about that use of ASCII:

  1. Characters

The URI syntax provides a method of encoding data, presumably for the sake of identifying a resource, as a sequence of characters. [...]

The ABNF notation defines its terminal values to be non-negative integers (codepoints) based on the US-ASCII coded character set [ASCII]. Because a URI is a sequence of characters, we must invert that relation in order to understand the URI syntax. Therefore, the integer values used by the ABNF must be mapped back to their corresponding characters via US-ASCII in order to complete the syntax rules.

Ie, ASCII is irrelevant to (the modern definition of) URLs except as it is a convenient and familiar way to refer to a certain familiar and rather small set of characters. There are reasons for this (that I'm not going to rehash here), and they are the same reasons why Python 3's behavior is "correct" IMHO (modulo the issue about the type of a list element, which I discuss above).

It is true that one might like there to be a literal that expresses `ord(bytes-object-of-length-one)', ie, something like o'a' == 97. (This is different from Greg's x'6465616462656566' == b'deadbeef', which I don't think helps solve the confusion problem although it would definitely be convenient.)



More information about the Python-Dev mailing list