[Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5 (original) (raw)

Steven D'Aprano steve at pearwood.info
Sat Jan 11 06:36:42 CET 2014


On Fri, Jan 10, 2014 at 06:17:02PM +0100, Juraj Sukop wrote:

As you may know, PDF operates over bytes and an integer or floating-point number is written down as-is, for example "100" or "1.23".

I'm sorry, I don't understand what you mean here. I'm honestly not trying to be difficult, but you sound confident that you understand what you are doing, but your description doesn't make sense to me. To me, it looks like you are conflating bytes and ASCII characters, that is, assuming that characters "are" in some sense identical to their ASCII representation. Let me explain:

The integer that in English is written as 100 is represented in memory as bytes 0x0064 (assuming a big-endian C short), so when you say "an integer is written down AS-IS" (emphasis added), to me that says that the PDF file includes the bytes 0x0064. But then you go on to write the three character string "100", which (assuming ASCII) is the bytes 0x313030. Going from the C short to the ASCII representation 0x313030 is nothing like inserting the int "as-is". To put it another way, the Python 2 '%d' format code does not just copy bytes.

I think that what you are trying to say is that a PDF file is a binary file which includes some ASCII-formatted text fields. So when writing an integer 100, rather than writing it "as is" which would be byte 0x64 (with however many leading null bytes needed for padding), it is converted to ASCII representation 0x313030 first, and that's what needs to be inserted.

If you consider PDF as binary with occasional pieces of ASCII text, then working with bytes makes sense. But I wonder whether it might be better to consider PDF as mostly text with some binary bytes. Even though the bulk of the PDF will be binary, the interesting bits are text. E.g. your example:

In the case of PDF, the embedding of an image into PDF looks like:

10 0 obj << /Type /XObject_ _/Width 100_ _/Height 100_ _/Alternates 15 0 R_ _/Length 2167_ _>> stream ...binary image data... endstream endobj

Even though the binary image data is probably much, much larger in length than the text shown above, it's (probably) trivial to deal with: convert your image data into bytes, decode those bytes into Latin-1, then concatenate the Latin-1 string into the text above.

Latin-1 has the nice property that every byte decodes into the character with the same code point, and visa versa. So:

for i in range(256): assert bytes([i]).decode('latin-1') == chr(i) assert chr(i).encode('latin-1') == bytes([i])

passes. It seems to me that your problem goes away if you use Unicode text with embedded binary data, rather than binary data with embedded ASCII text. Then when writing the file to disk, of course you encode it to Latin-1, either explicitly:

pdf = ... # Unicode string containing the PDF contents with open("outfile.pdf", "wb") as f: f.write(pdf.encode("latin-1")

or implicitly:

with open("outfile.pdf", "w", encoding="latin-1") as f: f.write(pdf)

There may be a few wrinkles I haven't thought of, I don't claim to be an expert on PDF. But I see no reason why PDF files ought to be an exception to the rule:

* work internally with Unicode text;

* convert to and from bytes only on input and output.

Please also take note that in Python 3.3 and better, the internal representation of Unicode strings containing only code points up to 255 (i.e. pure ASCII or pure Latin-1) is very efficient, using only one byte per character.

Another advantage is that using text rather than bytes means that your example:

[...]

dropping the bytes-formatting of numbers makes it more complicated than it was. I would appreciate any explanation on how:

b'%.1f %.1f %.1f RG' % (r, g, b)

becomes simply

'%.1f %.1f %.1f RG' % (r, g, b)

in Python 3. In Python 3.3 and above, it can be written as:

u'%.1f %.1f %.1f RG' % (r, g, b)

which conveniently is exactly the same syntax you would use in Python 2. That's much nicer than your suggestion:

is more confusing than:

b'%s %s %s RG' % tuple(map(lambda x: (u'%.1f' % x).encode('ascii'), (r, g, b)))

-- Steven



More information about the Python-Dev mailing list