(original) (raw)

On 12 Jan 2014 21:53, "Juraj Sukop" <juraj.sukop@gmail.com> wrote:
\>
\>
\>
\>
\> On Sun, Jan 12, 2014 at 2:35 AM, Steven D'Aprano <steve@pearwood.info> wrote:
\>>
\>> On Sat, Jan 11, 2014 at 08:13:39PM -0200, Mariano Reingart wrote:
\>>
\>> > AFAIK (and just for the record), there could be both Latin1 text and UTF-16
\>> > in a PDF (and other encodings too), depending on the font used:
\>> \[...\]
\>> > In Python2, txt is just a str, but in Python3 handling everything as latin1
\>> > string obviously doesn't work for TTF in this case.
\>>
\>> Nobody is suggesting that you use Latin-1 for \*everything\*. We're
\>> suggesting that you use it for blobs of binary data that represent
\>> arbitrary bytes. First you have to get your binary data in the first
\>> place, using whatever technique is necessary.�
\>
\>
\> Just to check I understood what you are saying. Instead of writing:
\>
\> � � content = b'\\n'.join(\[
\> � � � � b'header',
\> � � � � b'part 2 %.3f' % number,
\> � � � � binary\_image\_data,
\> � � � � utf16\_string.encode('utf-16be'),
\> � � � � b'trailer'\])
\>
\> it should now look like:
\>
\> � � content = '\\n'.join(\[
\> � � � � 'header',
\> � � � � 'part 2 %.3f' % number,
\> � � � � binary\_image\_data.decode('latin-1'),
\> � � � � utf16\_string.encode('utf-16be').decode('latin-1'),
\> � � � � 'trailer'\]).encode('latin-1')

Why are you proposing to do the \*join\* in text space? Encode all the parts separately, concatenate them with b'\\n'.join() (or whatever separator is appropriate). It's only the \*text formatting operation\* that needs to be done in text space and then explicitly encoded (and this example doesn't even need latin-1,ASCII is sufficient):

� � content = b'\\n'.join(\[
� � � � b'header',
� �� ('part 2 %.3f' % number).encode('ascii'),
� �� binary\_image\_data,
�� utf16\_string.encode('utf-16be'),
� � � � b'trailer'\])

> Correct?

My updated version above is the reasonable way to do it in Python 3, and the one I consider clearly superior to reintroducing implicit encoding to ASCII as part of the core text model.

This is why I \*don't\* have a problem with PEP 460 as it stands - it's just syntactic sugar for something you can already do with b''.join(), and thus not particularly controversial.

It's only proposals that add any form of implicit encoding
that silently switches from the text domain to the binary domain that conflict with the core Python 3 text model (although third party types remain largely free to do whatever they want).

Cheers,
Nick.

>
\> \_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_
\> Python-Dev mailing list
\> Python-Dev@python.org
\> https://mail.python.org/mailman/listinfo/python-dev
\> Unsubscribe: https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com
\>