[Python-Dev] PEP 460 reboot (original) (raw)

Stephen J. Turnbull stephen at xemacs.org
Thu Jan 16 05:39:30 CET 2014


Nick Coghlan writes:

Yes, I'm currently thinking the appropriate approach to the docs will be to remove the current "these have most of the str methods too" paragraph for binary sequences and instead create three completely explicit lists of methods:

I'm not sure what that means. If you mean that in the format string for .format() and %-formatting, bytes 0-127 must always have ASCII coded character semantics with bytes 128-255 unrestricted, indeed, that is the pragmatic restriction. Is there anything else?

The implications of this should be made clear, though: funky Asian encodings cannot be safely used in format strings for format(), GB18030 isn't safe in %-formatting either, and the value returned by these operations should be assumed to be non-ASCII-compatible unless proven otherwise (no iterated formatting).

I think you also need

since as far as I know the only strictly ASCII-compatible binary formats are ISO 2022-compatible encodings and UTF-8, ie, text, and the characters represented with bytes in the range 128-255 are not handled by bytes versions of the case-checking and case-converting operations, and so have extremely dubious semantics unless the data is pure ASCII. This is also true of most of the is_* operations.

Note that .center and .strip have pretty dubious semantics for arbitrary "ASCII-compatible" data:

b"abc\r\n".center(15) b' abc\r\n '

" \xA0abc\xA0 ".strip() 'abc' b" \xA0abc\xA0 ".strip() b'\xa0abc\xa0'

Of course the case of .center() is purely a programmer error, and I don't have a use case where it's problematic in practice. But it's sort of unpleasant.

Although I have internalized Guido's point that what's important is that there be no implicit conversions between bytes and str, I still worry that this slew of subtle semantic differences when moving str methods wholesale to bytes is a bug magnet.

I have an especially bad feeling about str-into-bytes interpolation. If people want that, they should use a type like asciistr that provides more or less firm guarantees that the content is pure ASCII.

PEP 461 would add a fourth category, of being provided, but with more restricted semantics.

I haven't looked closely at PEP 461 yet, and I'm not sure I'm going to have time this week.



More information about the Python-Dev mailing list