[Python-3000] base64 - bytes and strings (original) (raw)

Talin talin at acm.org
Mon Jul 30 03:21:13 CEST 2007


Guido van Rossum wrote:

On 7/29/07, Greg Ewing <greg.ewing at canterbury.ac.nz> wrote:

Martin v. Löwis wrote:

The point that proponents of "base64 encoding should yield strings" miss is that US-ASCII is both a character set, and an encoding. Last time we discussed this, I went and looked at the RFC where base64 is defined. According to my reading of it, nowhere does it say that base64 output must be encoded as US-ASCII, nor any other particular encoding.

It does say that the characters used were chosen because they are present in a number of different character sets in use at the time, and explicity mentions EBCDIC as one of those character sets. To me this quite clearly says that base64 is defined at the level of characters, not encodings. I think it's all besides the point. We should look at the use cases. I recall finding out once that a Java base64 implementation was much slower than Python's -- turns out that the Java version was converting everything to Strings; then we needed to convert back to bytes in order to output them. My suspicion is that in the end using bytes is more efficient and more convenient; it might take some looking through the email package to confirm or refute this. (The email package hasn't been converted to work in the struni branch; that should happen first. Whoever does that might well be the one who tells us how they want their base64 APIs.) An alternative might be to provide both string- and bytes-based APIs, although that doesn't help with deciding what the default one (the one that uses the same names as 2.x) should do.

One has to be careful when comparing performance with Java, because you need to specify whether you are using the "old" API or the "new" one. (It seems that almost everything in Java has an old and new API.)

I just recently did some work in Java with base64 encoding, or more specifically, URL-safe encoding. The library I was working with both consumed and produced arrays of bytes. I think that this is the correct way to do it.

In my specific use case, I was dealing with encrypted bytes, where the encrypter also produced and consumed bytes, so it made sense that the character encoder did the same. But even in the case where no encryption is involved, I think dealing with bytes is right.

I believe that converting a Unicode string to a base64 encoded form is necessarily a 2-step process. Step 1 is to convert from unicode characters to bytes, using an appropriate character encoding (UTF-8, UTF-16, and so on), and step 2 is to encode the bytes in base64. The resulting encoded byte array is actually an ASCII-encoded string, although it's more convenient in most cases to represent it as a byte array than as a string object, since it's likely in most cases that you are about to send it over the wire. So in other words, it makes sense to think about the conversion as (string -> bytes -> string), the actual objects being generated are (string -> bytes -> bytes).

The fact that 2 steps are needed is evident by the fact that there are actually two encodings involved, and these two encodings are mostly independent. So for example, one could just as easily base64-encode a UTF-16 encoded string as opposed to a UTF-8 encoded string. So the fact that you can vary one encoding without changing the other would seem to argue for the notion that they are distinct and independent.

Nor can you collapse to a single encoding step - you can't go directly from an internal unicode string to base64, since a unicode string is an array of code units which range from 1-0xffff, and base64 can't encode a number larger than 255.

Now, you could do both steps in a single function. However, you still have to choose what the intermediate encoding form is, even if you never actually see it. Usually this will be UTF-8.

-- Talin



More information about the Python-3000 mailing list