[Python-Dev] PEP 460: allowing %d and %f and mojibake (original) (raw)

Guido van Rossum guido at python.org
Tue Jan 14 19:16:17 CET 2014


[Other readers: asciistr is at https://github.com/jeamland/asciicompat]

On Mon, Jan 13, 2014 at 11:44 PM, Nick Coghlan <ncoghlan at gmail.com> wrote:

Right, asciistr is designed for a specific kind of hybrid API where you want to accept binary input (and produce binary output) and you want to accept text input (and produce text output). Porting those from Python 2 to Python 3 is painful not because of any limitations of the str or bytes API but because it's the only use case I have found where I actually missed the implicit interoperability offered by the Python 2 str type.

Yes, the use case is clear.

It's not an implementation style I would consider appropriate for the standard library - we need to code very defensively in order to aid debugging in arbitrary contexts, so I consider having an API like urllib.parse demand 7-bit ASCII in the binary version, and require text to handle impure input to be a better design choice.

This surprises me. I think asciistr should strive to be useful for the stdlib as well.

However, in an environment where you can place greater preconditions on your inputs (such as "ensure all input data is ASCII compatible")

That gives me the Python 2 willies. :-(

and you're willing to tolerate the occasional obscure traceback for particular kinds of errors,

Really? Can you give an example where the traceback using asciistr() would be more obscure than using the technique you used in urllib.parse?

then it should be a convenient way to use common constants (like separators or URL scheme names) in an algorithm that can manipulate either binary or text, but not a combination of the two (the latter is still a nice improvement in correctness over Python 2, which allowed them to be mixed freely rather than requiring consistency across the inputs).

Unfortunately I suspect there are still examples where asciistr's "submissive" behavior can produce surprises. E.g. consider a function of two arguments that must either be both bytes or both str. It's easily conceivable that for certain combinations of incorrect arguments (i.e. one bytes and one str) the function doesn't raise an error but returns something of one or the other type. (And this is exactly the Python 2 outcome we're trying to avoid.)

It's still slightly different from Python 2, though. In Python 2, the interaction model was:

str & str -> str str & unicode -> unicode (with the one exception being str.format: that consistently produces str rather than promoting to Unicode)

Or raises good old UnicodeError. :-(

My goal for asciistr is that it should exhibit the following behaviour:

str & asciistr -> str asciistr & asciistr -> str (making it asciistr would be a pain and I don't have a use case for that)

I almost had one in the example code I sent in response to Greg.

bytes & asciistr -> bytes

I understand that '&' here stands for "any arbitrary combination", but what about searches? Given that asciistr's base class is str, won't it still blow up if you try to use it as an argument to e.g. bytes.startswith()? Equality tests also sound problematic; is b'x' == asciistr('x') == 'x' ???

So in code like that in urllib.parse (but in a more constrained context), you could just switch all your constants to asciistr, change your indexing operations to length 1 slices and then in theory essentially the same code that worked in Python 2 should also work in Python 3.

The more I think about this, the less I believe it's that easy. I suspect you had the right idea when you mentioned singledispatch. It might be easier to write the bytes version in terms of the string versions wrapped in decode/encode, or vice versa, rather than trying to reason out all the different combinations of str, bytes, asciistr.

However, Benno is finding that my warning about possible interoperability issues was accurate - we have various places where we do PyUnicodeCheck() rather than PyUnicodeCheckExact(), which means we don't always notice a PEP 3118 buffer interface if it is provided by a str subclass.

Not sure I understand this, but I believe him when he says this won't be easy.

We'll look at those as we find them, and either work around them (if we can), decide not to support that behaviour in asciistr, or else I'll create a patch to resolve the interoperability issue.

It's not necessarily a type I'd recommend using in production code, as there will always be a more explicit alternative that doesn't rely on a tricksy C extension type that only works in CPython. However, it's a type I think is worth having implemented and available on PyPI, even if it's just to disprove the claim that you can't write that kind of code in Python 3.

Hm. It is beginning to sound more and more flawed. I also worry that it will bring back the nightmare of data-dependent UnicodeError back. E.g. this (from tests/basic.py):

def test_asciistr_will_not_accept_codepoints_above_127(self):
    self.assertRaises(ValueError, asciistr, 'Schrödinger')

looks reasonable enough when you assume asciistr() is always used with a literal as argument -- but I suspect that plenty of people would misunderstand its purpose and write asciistr(s) as a "clever" way to turn a string into something that's compatible with both bytes and strings... :-(

-- --Guido van Rossum (python.org/~guido)



More information about the Python-Dev mailing list