[Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5 (original) (raw)

Steven D'Aprano steve at pearwood.info
Mon Jan 13 00:43:55 CET 2014


On Mon, Jan 13, 2014 at 07:31:16AM +0900, Stephen J. Turnbull wrote:

Steven D'Aprano writes:

> then the name is horribly misleading, and it is best handled like this: > > content = '\n'.join([ > 'header', > 'part 2 %.3f' % number, > binaryimagedata.decode('latin-1'), > utf16string, # Misleading name, actually Unicode string > 'trailer']) This loses bigtime, as any encoding that can handle non-latin1 in utf16string will corrupt binaryimagedata. OTOH, latin1 will raise on non-latin1 characters. utf16string must be encoded appropriately then decoded by latin1 to be reencoded by latin1 on output.

Of course you're right, but I have understood the above as being a sketch and not real code. (E.g. does "header" really mean the literal string "header", or does it stand in for something which is a header?) In real code, one would need to have some way of telling where the binary image data ends and the Unicode string begins.

If I have misunderstood the situation, then my apologies for compounding the error

[...]

> Both examples assume that you intend to do further processing of content > before sending it, and will encode just before sending: > > content.encode('utf-8') > > (Don't use Latin-1, since it cannot handle the full range of text > characters.)

This corrupts binaryimagedata. Each byte > 127 will be replaced by two bytes.

And reading it back using decode('utf-8') will replace those two bytes with a single byte, round-tripping exactly.

Of course if you encode to UTF-8 and then try to read the binary data as raw bytes, you'll get corrupted data. But do people expect to do this? That's a genuine question -- again, I assumed (apparently wrongly) that the idea was to write the content out as text containing smuggled bytes, and read it back the same way.

In the second case, you can use latin1 to encode, it it gives you what you want.

This kind of subtlety is precisely why MAL warned about use of latin1 to smuggle bytes.

How would you smuggle a chunk of arbitrary bytes into a text string? Short of doing something like uuencoding it into ASCII, or equivalent.

-- Steven



More information about the Python-Dev mailing list