(original) (raw)




On Sat, Jan 11, 2014 at 6:36 AM, Steven D'Aprano <steve@pearwood.info> wrote:

I'm sorry, I don't understand what you mean here. I'm honestly not
trying to be difficult, but you sound confident that you understand what
you are doing, but your description doesn't make sense to me. To me, it
looks like you are conflating bytes and ASCII characters, that is,
assuming that characters "are" in some sense identical to their ASCII
representation. Let me explain:

The integer that in English is written as 100 is represented in memory
as bytes 0x0064 (assuming a big-endian C short), so when you say "an
integer is written down AS-IS" (emphasis added), to me that says that
the PDF file includes the bytes 0x0064\. But then you go on to write the
three character string "100", which (assuming ASCII) is the bytes
0x313030\. Going from the C short to the ASCII representation 0x313030 is
nothing like inserting the int "as-is". To put it another way, the
Python 2 '%d' format code does not just copy bytes.

Sorry, I should've included an example: when I said "as-is" I meant "1", "0", "0" so that would be yours "0x313030."
If you consider PDF as binary with occasional pieces of ASCII text, then
working with bytes makes sense. But I wonder whether it might be better
to consider PDF as mostly text with some binary bytes. Even though the
bulk of the PDF will be binary, the interesting bits are text. E.g. your
example:

Even though the binary image data is probably much, much larger in
length than the text shown above, it's (probably) trivial to deal with:
convert your image data into bytes, decode those bytes into Latin-1,
then concatenate the Latin-1 string into the text above.

This is similar to what Chris Barker suggested. I also don't try to be difficult here but please explain to me one thing. To treat bytes as if they were Latin-1 is bad idea, that's why "%f" got dropped in the first place, right? How is it then alright to put an image inside an Unicode string?

Also, apart from the in/out conversions, do any other difficulties come to your mind?

Please also take note that in Python 3.3 and better, the internal

representation of Unicode strings containing only code points up to 255

(i.e. pure ASCII or pure Latin-1) is very efficient, using only one byte

per character.


I guess you meant [C]Python...

In any case, thanks for the detailed reply.