[Python-Dev] PEP 460: allowing %d and %f and mojibake (original) (raw)

Scott Dial scott+python-dev at scottdial.com
Mon Jan 13 04:26:24 CET 2014


On 2014-01-11 22:09, Nick Coghlan wrote:

For Python 2 folks trying to grok where the "bright line" is in terms of the Python 3 text model: if your proposal includes any kind of implicit serialisation of non binary data to binary, it is going to be rejected as an addition to the core bytes type. If it avoids crossing that line (as the buffer-API-only version of PEP 460 does), then we can talk.

To take such a hard-line stance, I would expect you to author a PEP to strip the ASCII conveniences from the bytes and bytearray types. Otherwise, I find it a bit schizophrenic to argue that methods like lower, upper, title, and etc. don't implicitly assume encoding:

a = "scott".encode('utf-16') b = a.title() c = b.decode('utf-16') 'SCOTT'

So, clearly title() not only depends on the bytes characters encoded in a superset of ASCII characters, it depends on the bytes being a sequence of ASCII characters, which looks an awful lot like an operation on an implicit encoded string.

b"文字化け" File "", line 1 SyntaxError: bytes can only contain ASCII literal characters.

There is an implicit serialization right there. My terminal is utf8 (or even if my source encoding is utf8), so why would that not be:

b'\xe6\x96\x87\xe5\xad\x97\xe5\x8c\x96\xe3\x81\x91'

I sympathize with Ethan that the bytes and bytearray types already seem to concede that bytes is the type you want to use for 7-bit ASCII manipulations. If that is not what we want, then we are not doing a good job communicating that to developers with the API. At the onset, the bytes literal itself seems to be an attractive nuisance as it gives a nod to using bytes for ASCII character sequences (a.k.a ASCII strings).

Regards, -Scott

-- Scott Dial scott at scottdial.com



More information about the Python-Dev mailing list