[Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5 (original) (raw)
Steven D'Aprano steve at pearwood.info
Sun Jan 12 18:22:21 CET 2014
- Previous message: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
- Next message: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Sun, Jan 12, 2014 at 12:52:18PM +0100, Juraj Sukop wrote:
On Sun, Jan 12, 2014 at 2:35 AM, Steven D'Aprano <steve at pearwood.info>wrote:
> On Sat, Jan 11, 2014 at 08:13:39PM -0200, Mariano Reingart wrote: > > > AFAIK (and just for the record), there could be both Latin1 text and > UTF-16 > > in a PDF (and other encodings too), depending on the font used: > [...] > > In Python2, txt is just a str, but in Python3 handling everything as > latin1 > > string obviously doesn't work for TTF in this case. > > Nobody is suggesting that you use Latin-1 for everything. We're > suggesting that you use it for blobs of binary data that represent > arbitrary bytes. First you have to get your binary data in the first > place, using whatever technique is necessary.
Just to check I understood what you are saying. Instead of writing: content = b'\n'.join([ b'header', b'part 2 %.3f' % number, binaryimagedata, utf16string.encode('utf-16be'), b'trailer'])
Which doesn't work, since bytes don't support %f in Python 3.
it should now look like:
content = '\n'.join([ 'header', 'part 2 %.3f' % number, binaryimagedata.decode('latin-1'), utf16string.encode('utf-16be').decode('latin-1'), 'trailer']).encode('latin-1') Correct?
Not quite as you show.
First, "utf16_string" confuses me. What is it? If it is a Unicode string, i.e.:
Python 3 semantics
type(utf16_string) => returns str
then the name is horribly misleading, and it is best handled like this:
content = '\n'.join([
'header',
'part 2 %.3f' % number,
binary_image_data.decode('latin-1'),
utf16_string, # Misleading name, actually Unicode string
'trailer'])
Note that since it's text, and content is text, there is no need to encode then decode.
"UTF-16" is not another name for "Unicode". Unicode is a character set. UTF-16 is just one of a number of different encodings which map the 0x10FFFF distinct Unicode characters (actually "code points") to bytes. UTF-16 is one possible way to implement Unicode strings in memory, but not the only way. Python has, or does, use four distinct implementations:
UTF-16 in "narrow builds"
UTF-32 in "wide builds"
a hybrid approach starting in Python 3.3, where strings are stored as either:
3a) Latin-1 3b) UCS-2 3c) UTF-32
depending on the content of the string.
So calling an arbitrary string "utf16_string" is misleading or wrong.
On the other hand, if it is actually a bytes object which is the product of UTF-16 encoding, i.e.:
type(utf16_string) => returns bytes
and those bytes were generated by "some text".encode("utf-16"), then it is already binary data and needs to be smuggled into the text string. Latin-1 is good for that:
content = '\n'.join([
'header',
'part 2 %.3f' % number,
binary_image_data.decode('latin-1'),
utf16_string.decode('latin-1'),
'trailer'])
Both examples assume that you intend to do further processing of content before sending it, and will encode just before sending:
content.encode('utf-8')
(Don't use Latin-1, since it cannot handle the full range of text characters.)
If that's not the case, then perhaps this is better suited to what you are doing:
content = b'\n'.join([
b'header',
('part 2 %.3f' % number).encode('ascii'),
binary_image_data, # already bytes
utf16_string, # already bytes
b'trailer'])
-- Steven
- Previous message: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
- Next message: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]