[Python-Dev] bytes.from_hex() (original) (raw)

Bengt Richter bokr at oz.net
Sat Feb 18 08:24:31 CET 2006

Previous message: [Python-Dev] bytes.from_hex()
Next message: [Python-Dev] bytes.from_hex()
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Fri, 17 Feb 2006 20:33:16 -0800, Josiah Carlson <jcarlson at uci.edu> wrote:

Greg Ewing <greg.ewing at canterbury.ac.nz> wrote:

Stephen J. Turnbull wrote: >>>>>>"Guido" == Guido van Rossum <guido at python.org> writes: > Guido> - b = bytes(t, enc); t = text(b, enc) > > +1 The coding conversion operation has always felt like a constructor > to me, and in this particular usage that's exactly what it is. I > prefer the nomenclature to reflect that. This also has the advantage that it competely avoids using the verbs "encode" and "decode" and the attendant confusion about which direction they go in. e.g. s = text(b, "base64") makes it obvious that you're going from the binary side to the text side of the base64 conversion. But you aren't always getting unicode text from the decoding of bytes, and you may be encoding bytes to bytes: b2 = bytes(b, "base64") b3 = bytes(b2, "base64") Which direction are we going again? Well, base64 is probably not your best example, because it necessarily involves characters ;-)

If you are using "base64" you are looking at characters in your input to produce your bytes output. The only way you can see characters in bytes input is to decode them. So you are hiding your assumption about b's encoding.

You can make useful rules of inference from type(b), but with bytes you really don't know. "base64" has to interpret b bytes as characters, because that's what it needs to recognize base64 characters, to produce the output bytes.

The characters in b could be encoded in plain ascii, or utf16le, you have to know. So for utf16le it should be

 b2 = bytes(text(b, 'utf16le'), "base64")

just because you assume an implicit

 b2 = bytes(text(b, 'ascii'), "base64")

doesn't make it so in general. Even if you build that assumption in, it's not really true that you are going "bytes to bytes" without characters involved when you do bytes(b, "base64"). You have just left undocumented an API restriction (assert ) and an implementation optimization ;-)

This is the trouble with str.encode and unicode.decode. They both hide implicit decodes and encodes respectively. They should be banned IMO. Let people spell it out and maybe understand what they are doing.

OTOH, a bytes-to-bytes codec might be decompressing tgz into tar. For conceptual consistency, one might define a 'bytes' encoding that conceptually turns bytes into unicode byte characters and vice versa. Then "gunzip" can decode bytes, producing unicode characters which are then encoded back to bytes from the unicode ;-) The 'bytes' encoding would numerically be just like latin-1 except on the unicode side it would have wrapped-bytes internal representation.

b_tar = bytes(text(b_tgz, 'gunzip'), 'bytes')

of course, text(b_tgz, 'gunzip') would produce unicode text with a special internal representation that just wraps bytes though they are true unicode. The 'bytes' codec encode of course would just unwrap the internal bytes representation, but it would conceptually be an encoding into bytes. bytes(t, 'latin-1') would produce the same output from the wrapped bytes unicode.

Sometimes conceptual purity can clarify things and sometimes it's just another confusing description.

Regards, Bengt Richter

Previous message: [Python-Dev] bytes.from_hex()
Next message: [Python-Dev] bytes.from_hex()
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list