[Python-Dev] bytes.from_hex() (original) (raw)

Greg Ewing greg.ewing at canterbury.ac.nz
Thu Mar 2 06:16:52 CET 2006


Ron Adam wrote:

1. We can specify the operation and not be sure of the resulting type.

or 2. We can specify the type and not always be sure of the operation. maybe there's a way to specify both so it's unambiguous?

Here's another take on the matter. When we're doing Unicode encoding or decoding, we're performing a type conversion. The natural way to write a type conversion in Python is with a constructor. But we can't just say

u = unicode(b)

because that doesn't give enough information. We want to say that b is really of type e.g. "bytes containing utf8 encoded text":

u = unicode(b, 'utf8')

Here we're not thinking of the 'utf8' as selecting an encoder or decoder, but of giving extra information about the type of b, that isn't carried by b itself.

Now, going in the other direction, we might think to write

b = bytes(u, 'utf8')

But that wouldn't be right, because if we interpret this consistently it would mean we're saying that u contains utf8-encoded information, which is nonsense. What we need is a way of saying "construct me something of type 'bytes containing utf8-encoded text'":

b = bytes'utf8'

Here I've coined the notation t[enc] which evaluates to a callable object which constructs an object of type t by encoding its argument according to enc.

Now let's consider base64. Here, the roles of bytes and unicode are reversed, because the bytes are just bytes without any further interpretation, whereas the unicode is really "unicode containing base64 encoded data". So we write

u = unicode'base64' # encoding

b = bytes(u, 'base64') # decoding

Note that this scheme is reasonably idiot-proof, e.g.

u = unicode(b, 'base64')

results in a type error, because this specifies a decoding operation, and the base64 decoder takes text as input, not bytes.

What happens with transformations where the input and output types are the same? In this scheme, they're not really the same any more, because we're providing extra type information. Suppose we had a code called 'piglatin' which goes from unicode to unicode. The types involved are really "text" and "piglatin-encoded text", so we write

u2 = unicode'piglatin' # encoding

u1 = unicode(u2, 'piglatin') # decoding

Here you won't get any type error if you get things backwards, but there's not much that can be done about that. You just have to keep straight which of your strings contain piglatin and which don't.

Is this scheme any better than having encode and decode methods/functions? I'm not sure, but it shows that a suitably enhanced notion of "data type" can be used to replace the notions of encoding and decoding and maybe reduce potential confusion about which direction is which.

-- Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | Carpe post meridiam! | Christchurch, New Zealand | (I'm not a morning person.) | greg.ewing at canterbury.ac.nz +--------------------------------------+



More information about the Python-Dev mailing list