[Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5 (original) (raw)

Steven D'Aprano steve at pearwood.info
Sun Jan 12 18:22:21 CET 2014


On Sun, Jan 12, 2014 at 12:52:18PM +0100, Juraj Sukop wrote:

On Sun, Jan 12, 2014 at 2:35 AM, Steven D'Aprano <steve at pearwood.info>wrote:

> On Sat, Jan 11, 2014 at 08:13:39PM -0200, Mariano Reingart wrote: > > > AFAIK (and just for the record), there could be both Latin1 text and > UTF-16 > > in a PDF (and other encodings too), depending on the font used: > [...] > > In Python2, txt is just a str, but in Python3 handling everything as > latin1 > > string obviously doesn't work for TTF in this case. > > Nobody is suggesting that you use Latin-1 for everything. We're > suggesting that you use it for blobs of binary data that represent > arbitrary bytes. First you have to get your binary data in the first > place, using whatever technique is necessary.

Just to check I understood what you are saying. Instead of writing: content = b'\n'.join([ b'header', b'part 2 %.3f' % number, binaryimagedata, utf16string.encode('utf-16be'), b'trailer'])

Which doesn't work, since bytes don't support %f in Python 3.

it should now look like:

content = '\n'.join([ 'header', 'part 2 %.3f' % number, binaryimagedata.decode('latin-1'), utf16string.encode('utf-16be').decode('latin-1'), 'trailer']).encode('latin-1') Correct?

Not quite as you show.

First, "utf16_string" confuses me. What is it? If it is a Unicode string, i.e.:

Python 3 semantics

type(utf16_string) => returns str

then the name is horribly misleading, and it is best handled like this:

content = '\n'.join([
    'header',
    'part 2 %.3f' % number,
    binary_image_data.decode('latin-1'),
    utf16_string,  # Misleading name, actually Unicode string
    'trailer'])

Note that since it's text, and content is text, there is no need to encode then decode.

"UTF-16" is not another name for "Unicode". Unicode is a character set. UTF-16 is just one of a number of different encodings which map the 0x10FFFF distinct Unicode characters (actually "code points") to bytes. UTF-16 is one possible way to implement Unicode strings in memory, but not the only way. Python has, or does, use four distinct implementations:

  1. UTF-16 in "narrow builds"

  2. UTF-32 in "wide builds"

  3. a hybrid approach starting in Python 3.3, where strings are stored as either:

    3a) Latin-1 3b) UCS-2 3c) UTF-32

    depending on the content of the string.

So calling an arbitrary string "utf16_string" is misleading or wrong.

On the other hand, if it is actually a bytes object which is the product of UTF-16 encoding, i.e.:

type(utf16_string) => returns bytes

and those bytes were generated by "some text".encode("utf-16"), then it is already binary data and needs to be smuggled into the text string. Latin-1 is good for that:

content = '\n'.join([
    'header',
    'part 2 %.3f' % number,
    binary_image_data.decode('latin-1'),
    utf16_string.decode('latin-1'),
    'trailer'])

Both examples assume that you intend to do further processing of content before sending it, and will encode just before sending:

content.encode('utf-8')

(Don't use Latin-1, since it cannot handle the full range of text characters.)

If that's not the case, then perhaps this is better suited to what you are doing:

content = b'\n'.join([
    b'header',
    ('part 2 %.3f' % number).encode('ascii'),
    binary_image_data,  # already bytes
    utf16_string,  # already bytes
    b'trailer'])

-- Steven



More information about the Python-Dev mailing list