[Python-Dev] bytes.from_hex() (original) (raw)

Greg Ewing greg.ewing at canterbury.ac.nz
Wed Feb 22 12:35:39 CET 2006

Previous message: [Python-Dev] bytes.from_hex()
Next message: [Python-Dev] bytes.from_hex()
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Stephen J. Turnbull wrote:

Base64 is a (family of) wire protocol(s). It's not clear to me that it makes sense to say that the alphabets used by "baseNN" encodings are composed of characters,

Take a look at

http://en.wikipedia.org/wiki/Base64

where it says

...base64 is a binary to text encoding scheme whereby an arbitrary sequence of bytes is converted to a sequence of printable ASCII characters.

Also see RFC 2045 (http://www.ietf.org/rfc/rfc2045.txt) which defines base64 in terms of an encoding from octets to characters, and also says

A 65-character subset of US-ASCII is used ... This subset has the important property that it is represented identically in all versions of ISO 646 ... and all characters in the subset are also represented identically in all versions of EBCDIC.

Which seems to make it perfectly clear that the result of the encoding is to be considered as characters, which are not necessarily going to be encoded using ascii.

So base64 on its own is not a wire protocol. Only after encoding the characters do you have a wire protocol.

I don't see any case for "correctness" here, only for convenience,

I'm thinking of convenience, too. Keep in mind that in Py3k, 'unicode' will be called 'str' (or something equally neutral like 'text') and you will rarely have to deal explicitly with unicode codings, this being done mostly for you by the I/O objects. So most of the time, using base64 will be just as convenient as it is today: base64_encode(my_bytes) and write the result out somewhere.

The reason I say it's corrrect is that if you go straight from bytes to bytes, you're assuming the eventual encoding is going to be an ascii superset. The programmer is going to have to know about this assumption and understand all its consequences and decide whether it's right, and if not, do something to change it.

Whereas if the result is text, the right thing happens automatically whatever the ultimate encoding turns out to be. You can take the text from your base64 encoding, combine it with other text from any other source to form a complete mail message or xml document or whatever, and write it out through a file object that's using any unicode encoding at all, and the result will be correct.

it's also efficient to use bytes<->bytes for XML, since conversion of base64 bytes to UTF-8 characters is simply a matter of "Simon says, be UTF-8!"

Efficiency is an implementation concern. In Py3k, strings which contain only ascii or latin-1 might be stored as 1 byte per character, in which case this would not be an issue.

And in the classroom, you're just going to confuse students by telling them that UTF-8 --[Unicode codec]--> Python string is decoding but UTF-8 --[base64 codec]--> Python string is encoding, when MAL is telling them that --> Python string is always decoding.

Which is why I think that only unicode codings should be available through the .encode and .decode interface. Or alternatively there should be something more explicit like .unicode_encode and .unicode_decode that is thus restricted.

Also, if most unicode coding is done in the I/O objects, there will be far less need for programmers to do explicit unicode coding in the first place, so likely it will become more of an advanced topic, rather than something you need to come to grips with on day one of using unicode, like it is now.

-- Greg

Previous message: [Python-Dev] bytes.from_hex()
Next message: [Python-Dev] bytes.from_hex()
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list