[Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5 (original) (raw)

Steven D'Aprano steve at pearwood.info
Sat Jan 11 16:38:39 CET 2014


On Sat, Jan 11, 2014 at 01:56:56PM +0100, Juraj Sukop wrote:

On Sat, Jan 11, 2014 at 6:36 AM, Steven D'Aprano <steve at pearwood.info>wrote:

> If you consider PDF as binary with occasional pieces of ASCII text, then > working with bytes makes sense. But I wonder whether it might be better > to consider PDF as mostly text with some binary bytes. Even though the > bulk of the PDF will be binary, the interesting bits are text. E.g. your > example:

10 0 obj
  << /Type /XObject
     /Width 100
     /Height 100
     /Alternates 15 0 R
     /Length 2167
  >>
stream
...binary image data...
endstream
endobj

> Even though the binary image data is probably much, much larger in > length than the text shown above, it's (probably) trivial to deal with: > convert your image data into bytes, decode those bytes into Latin-1, > then concatenate the Latin-1 string into the text above.

This is similar to what Chris Barker suggested. I also don't try to be difficult here but please explain to me one thing. To treat bytes as if they were Latin-1 is bad idea,

Correct. Bytes are not Latin-1. Here are some bytes which represent a word I extracted from a text file on my computer:

b'\x8a\x75\xa7\x65\x72\x73\x74'

If you imagine that they are Latin-1, you might think that the word is a C1 control character ("VTS", or Vertical Tabulation Set) followed by "u§erst", but it is not. It is actually the German word "äußerst" ("extremely"), and the text file was generated on a 1990s vintage Macintosh using the MacRoman "extended ASCII" code page.

that's why "%f" got dropped in the first place, right? How is it then alright to put an image inside an Unicode string?

The point that I am making is that many people want to add formatting operations to bytes so they can put ASCII strings inside bytes. But (as far as I can tell) they don't need to do this, because they can treat Unicode strings containing code points U+0000 through U+00FF (i.e. the same range as handled by Latin-1) as if they were bytes. This gives you:

No need to wait for Python 3.5 to come out, you can do this right now.

Of course, this is a little bit "unclean", it breaks the separation of text and bytes by treating bytes as if they were Unicode code points, which they are not, but I believe that this is a practical technique which is not too hard to deal with. For instance, suppose I have a mixed format which consists of an ASCII tag, a number written in ASCII, a NULL separator, and some binary data:

Using bytes

values = [29460, 29145, 31098, 27123] blob = b"".join(struct.pack(">h", n) for n in values) data = b"Tag:" + str(len(values)).encode('ascii') + b"\0" + blob

=> gives data = b'Tag:4\x00s\x14q\xd9yzi\xf3'

That's a bit ugly, but not too ugly. I could write code like that. But if bytes had % formatting, I might write this instead:

data = b"Tag:%d\0%s" % (len(values), blob)

This is a small improvement, but I can't use it until Python 3.5 comes out. Or I could do this right now:

Using text

values = [29460, 29145, 31098, 27123] blob = b"".join(struct.pack(">h", n) for n in values) data = "Tag:%d\0%s" % (len(values), blob.decode('latin-1'))

=> gives data = 'Tag:4\x00s\x14qÙyzió'

When I'm ready to transmit this over the wire, or write to disk, then I encode, and get:

data.encode('latin-1') => b'Tag:4\x00s\x14q\xd9yzi\xf3'

which is exactly the same as I got in the first place. In this case, I'm not using Latin-1 for the semantics of bytes to characters (e.g. byte \xf3 = char ó), but for the useful property that all 256 distinct bytes are valid in Latin-1. Any other encoding with the same property will do.

It is a little unfortunate that struct gives bytes rather than a str, but you can hide that with a simple helper function:

def b2s(bytes): return bytes.decode('latin1')

data = "Tag:%d\0%s" % (len(values), b2s(blob))

Also, apart from the in/out conversions, do any other difficulties come to your mind?

No. If you accidentally introduce a non-Latin1 code point, when you decode you'll get an exception.

-- Steven



More information about the Python-Dev mailing list