[Python-Dev] PEP 460 reboot (original) (raw)

Stephen J. Turnbull stephen at xemacs.org
Thu Jan 16 05:39:30 CET 2014

Previous message: [Python-Dev] PEP 460 reboot
Next message: [Python-Dev] PEP 460 reboot
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Nick Coghlan writes:

Yes, I'm currently thinking the appropriate approach to the docs will be to remove the current "these have most of the str methods too" paragraph for binary sequences and instead create three completely explicit lists of methods:

provided, works with arbitrary data

provided, assumes the use of an ASCII compatible data format

I'm not sure what that means. If you mean that in the format string for .format() and %-formatting, bytes 0-127 must always have ASCII coded character semantics with bytes 128-255 unrestricted, indeed, that is the pragmatic restriction. Is there anything else?

The implications of this should be made clear, though: funky Asian encodings cannot be safely used in format strings for format(), GB18030 isn't safe in %-formatting either, and the value returned by these operations should be assumed to be non-ASCII-compatible unless proven otherwise (no iterated formatting).

I think you also need

provided, assumes pure ASCII-encoded text

since as far as I know the only strictly ASCII-compatible binary formats are ISO 2022-compatible encodings and UTF-8, ie, text, and the characters represented with bytes in the range 128-255 are not handled by bytes versions of the case-checking and case-converting operations, and so have extremely dubious semantics unless the data is pure ASCII. This is also true of most of the is_* operations.

Note that .center and .strip have pretty dubious semantics for arbitrary "ASCII-compatible" data:

b"abc\r\n".center(15) b' abc\r\n '

" \xA0abc\xA0 ".strip() 'abc' b" \xA0abc\xA0 ".strip() b'\xa0abc\xa0'

Of course the case of .center() is purely a programmer error, and I don't have a use case where it's problematic in practice. But it's sort of unpleasant.

Although I have internalized Guido's point that what's important is that there be no implicit conversions between bytes and str, I still worry that this slew of subtle semantic differences when moving str methods wholesale to bytes is a bug magnet.

I have an especially bad feeling about str-into-bytes interpolation. If people want that, they should use a type like asciistr that provides more or less firm guarantees that the content is pure ASCII.

not provided

PEP 461 would add a fourth category, of being provided, but with more restricted semantics.

I haven't looked closely at PEP 461 yet, and I'm not sure I'm going to have time this week.

Previous message: [Python-Dev] PEP 460 reboot
Next message: [Python-Dev] PEP 460 reboot
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list