[Python-Dev] Why can't I encode/decode base64 without importing a module? (original) (raw)

MRAB python at mrabarnett.plus.com
Thu Apr 25 18:53:53 CEST 2013


On 25/04/2013 15:22, MRAB wrote:

On 25/04/2013 14:34, Lennart Regebro wrote:

On Thu, Apr 25, 2013 at 2:57 PM, Antoine Pitrou <solipsis at pitrou.net> wrote:

I can think of many usecases where I want to embed base64-encoded data in a larger text before encoding that text and transmitting it over a 8-bit channel.

That still doesn't mean that this should be the default behavior. Just because you can represent base64 as Unicode text doesn't mean that it should be. [snip] One use case where you clearly do want the base64 encoded data to be unicode strings is because you want to embed it in a text discussing base64 strings, for a blog or a book or something. That doesn't seem to be a very common usecase. For the most part you base64 encode things because it's going to be transmitted, and hence the natural result of a base64 encoding should be data that is ready to be transmitted, hence byte strings, and not Unicode strings.

Python 3 doesn't view text as unicode, it represents it as unicode. I don't agree that there is a significant difference between those wordings in this context. The end result is the same: Things intended to be handled/seen as textual should be unicode strings, things intended for data exchange should be byte strings. Something that is base64 encoded is primarily intended for data exchange. A base64 encoding should therefore return byte strings, especially since most API's that perform this transmission will take byte strings as input. If you want to include this in textual data, for whatever reason, like printing it in a book, then the conversion is trivial, but that is clearly the less common use case, and should therefore not be the default behavior. base64 is a way of encoding binary data as text. The problem is that traditionally text has been encoded with one byte per character, except in those locales where there were too many characters in the character set for that to be possible. In Python 3 we're trying to stop mixing binary data (bytestrings) with text (Unicode strings). RFC 4648 says """Base encoding of data is used in many situations to store or transfer data in environments that, perhaps for legacy reasons, are restricted to US-ASCII [1] data.""".

To me, "US-ASCII" is an encoding, so it appears to be talking about encoding binary data (bytestrings) to ASCII-encoded text (bytestrings).



More information about the Python-Dev mailing list