[Python-Dev] Why can't I encode/decode base64 without importing a module? (original) (raw)

R. David Murray rdmurray at bitdance.com
Tue Apr 23 16:16:01 CEST 2013


On Tue, 23 Apr 2013 22:29:33 +0900, "Stephen J. Turnbull" <stephen at xemacs.org> wrote:

R. David Murray writes:

> You transform into the encoding, and untransform out of the > encoding. Do you have an example where that would be ambiguous? In the bytes-to-bytes case, any pair of character encodings (eg, UTF-8 and ISO-8859-15) would do. Or how about in text, ReST to HTML?

If I write:

bytestring.transform('ISO-8859-15')

that would indeed be ambiguous, but only because I haven't named the source encoding of the bytestring. So the above is obviously nonsense, and the easiest "fix" is to have the things that are currently bytes-to-text or text-to-bytes character set transformations only work with encode/decode, and not transform/untransform.

BASE64 itself is ambiguous. By RFC specification, BASE64 is a textual representation of arbitrary binary data. (Cf. URIs.) The natural interpretation of .encode('base64') in that context would be as a bytes-to-text encoder. However, this has several problems. In practice, we invariably use an ASCII octet stream to carry BASE64- encoded data. So web developers would almost certainly expect a bytes-to-bytes encoder. Such a bytes-to-bytes encoder can't be duck-typed. Double-encoding bugs wouldn't be detected until the stream arrives at the user. And the RFC-based signature of .encode('base64') as bytes-to-text is precisely opposite to that of .encode('utf-8') (text-to-bytes).

I believe that after much discussion we have settled on these transformations (in their respective modules) accepting either bytes or strings as input for decoding, only bytes as input for encoding, and always producing bytes as output. (Note that the base64 docs need some clarification about this.)

Given this, the possible valid transformations would be:

bytestring.transform('base64') bytesstring.untransform('base64') string.untransform('base64')

and all would produce a byte string. That byte string would be in base64 for the first one, and a decoded binary string for the second two.

Given our existing API, I don't think we want

string.encode('base64')

to work (taking an ascii-only unicode string and returning bytes), and we've already agreed that adding a 'decode' method to string is not going to happen.

We could, however, and quite possibly should, disallow

string.untransform('base64')

even though the underly module supports it. Thus we would only have bytes-to-bytes transformations for 'base64' and its siblings, and you would write the unicode-ascii-to-bytes transformation as:

string.encode('ascii').untransform('base64')

which has some pedagogical value :).

If you do transform('base64') on a bytestring already encoded as base64 you get a double encoding, yes. I don't see that it is our responsibility to try to protect you from this mistake. The module functions certainly don't.

Given that, is there anything ambiguous about the proposed API?

(Note: if you would like to argue that, eg, base64.b64encode or binascii.b2a_base64 should return a string, it is too late for that argument for backward compatibility reasons.)

It is certainly true that there are many unambiguous cases. In the case of a true text processing facility (eg, Emacs buffers or Python 3 str) where there is an unambiguous text type with a constant and opaque internal representation, it makes a lot of sense to treat the text type as special/central, and use the terminology "encode [from text]" and "decode [to text]". It's easy to remember, which one is special is obvious, and the difference in input and output types means that mistaken use of the API will be detected by duck-typing.

However, in the case of bytes-bytes or text-text transformations, it's not the presence of unambiguous cases that should drive API design IMO. It's the presence of the ambiguous cases that we should cater to. I don't see easy solutions to this issue.

When I asked about ambiguous cases, I was asking for cases where the meaning of "transform('somecodec')" was ambiguous. Sure, it is possible to feed the wrong input into that transformation, but I consider that a programming error, not an ambiguity in the API. After all, you have exactly the same problem if you use the module functions directly, which is currently the only option.

--David



More information about the Python-Dev mailing list