[Python-Dev] accept string in a2b and base64? (original) (raw)

Nick Coghlan ncoghlan at gmail.com
Tue Feb 21 03:51:08 CET 2012


On Tue, Feb 21, 2012 at 11:24 AM, R. David Murray <rdmurray at bitdance.com> wrote:

If most people agree with Antoine I won't fight it, but it seems to me that accepting unicode in the binascii and base64 APIs is a bad idea.

I see it as essentially the same as the changes I made in urllib.urlparse to support pure ASCII bytes->bytes in many of the APIs (which work by doing an implicit ascii+strict decode at the beginning of the function, and then reversing that at the end). For those, if your byte sequence has non-ASCII data in it, they'll throw a UnicodeDecodeError and it's up to you to figure out where those non-ASCII bytes are coming from. Similarly, if one of these updated APIs throws ValueError, then you'll have to figure out where the non-ASCII code points are coming from.

Yes, it's a niggling irritation from a purist point of view, but it's also an acknowledgement of the fact that whether a pure ASCII sequence should be treated as a sequence of bytes or a sequence of code points is going to be application and context depended. Sometimes it will make more sense to treat it as binary data, other times as text.

The key point is that any multimode support that depends on implicit type conversion from bytes->str (or vice-versa) really needs to be limited to strict ASCII only (if no other information on the encoding is available). If something is 7-bit ASCII pure, then odds are very good that it really is ASCII text. As soon as that high-order bit gets set though, all bets are off and we have to push the text encoding problem back on the API caller to figure out.

The reason Python 2's implicit str<->unicode conversions are so problematic isn't just because they're implicit: it's because they effectively assume latin-1 as the encoding on the 8-bit str side. That means reliance on implicit decoding can silently corrupt non-ASCII data instead of triggering exceptions at the point of implicit conversion. If you're lucky, some other part of the application will detect the corruption and you'll have at least a vague hope of tracking it down. Otherwise, the corrupted data may escape the application and you'll have an even thornier debugging problem on your hands.

My one concern with the base64 patch is that it doesn't test that mixing types triggers TypeError. While this shouldn't require any extra code (the error should arise naturally from the method implementation), it should still be tested explicitly to ensure type mismatches fail as expected. Checking explicitly for mismatches in the code would then just be a matter of wanting to emit nice error messages explaining the problem rather than being needed for correctness reasons (e.g. urlparse uses pre-checks in order to emit a clear error message for type mismatches, but it has significantly longer function signatures to deal with).

Cheers, Nick.

-- Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia



More information about the Python-Dev mailing list