[Python-Dev] bytes.from_hex() (original) (raw)
Josiah Carlson jcarlson at uci.edu
Sat Feb 18 08:05:48 CET 2006
- Previous message: [Python-Dev] bytes.from_hex()
- Next message: [Python-Dev] bytes.from_hex()
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Bob Ippolito <bob at redivi.com> wrote:
On Feb 17, 2006, at 8:33 PM, Josiah Carlson wrote: > > Greg Ewing <greg.ewing at canterbury.ac.nz> wrote: >> >> Stephen J. Turnbull wrote: >>>>>>>> "Guido" == Guido van Rossum <guido at python.org> writes: >> >>> Guido> - b = bytes(t, enc); t = text(b, enc) >>> >>> +1 The coding conversion operation has always felt like a >>> constructor >>> to me, and in this particular usage that's exactly what it is. I >>> prefer the nomenclature to reflect that. >> >> This also has the advantage that it competely >> avoids using the verbs "encode" and "decode" >> and the attendant confusion about which direction >> they go in. >> >> e.g. >> >> s = text(b, "base64") >> >> makes it obvious that you're going from the >> binary side to the text side of the base64 >> conversion. > > But you aren't always getting unicode text from the decoding of > bytes, > and you may be encoding bytes to bytes: > > b2 = bytes(b, "base64") > b3 = bytes(b2, "base64") > > Which direction are we going again? This is exactly why the current set of codecs are INSANE. unicode.encode and str.decode should be used only for unicode codecs. Byte transforms are entirely different semantically and should be some other method pair.
The problem is that we are overloading data types. Strings (and bytes) can contain both encoded text as well as data, or even encoded data. Unless the plan is to make bytes only contain encoded unicode, or only data, or only encoded data, the confusion for users may continue. Me, I'm a fan of education. Educating your users is simple, and if you have good exceptions and documentation, it gets easier. Raise an exception when a user tries to use a codec which doesn't have a particular source ('...'.decode('utf-8') should raise an error like "Cannot use text as a source for 'utf-8' decoding", when unicode/text becomes the default format for string literals).
Tossing out bytes.encode(), as well as decodings for bytes->bytes, also brings up the issue of text.decode() for pure text transformations. Are we going to push all of those transformations somewhere else?
Look at what we've currently got going for data transformations in the standard library to see what these removals will do: base64 module, binascii module, binhex module, uu module, ... Do we want or need to add another top-level module for every future encoding/codec that comes out (or does everyone think that we're done seeing codecs)? Do we want to keep monkey-patching binascii with names like 'a2b_hqx'? While there is currently one text->text transform (rot13), do we add another module for text->text transforms? Would it start having names like t2e_rot13() and e2t_rot13()?
Educate the users. Raise better exceptions telling people why their encoding or decoding failed, as Ian Bicking already pointed out. If bytes.encode() and the equivalent of text.decode() is going to disappear, Bengt Richter had a good idea with bytes.recode() for strictly bytes transformations (and the equivalent for text), though it is ambiguous as to the direction; are we encoding or decoding with bytes.recode()? In my opinion, this is why .encode() and .decode() makes sense to keep on both bytes and text, the direction is unambiguous, and if one has even a remote idea of what the heck the codec is, they know their result.
- Josiah
- Previous message: [Python-Dev] bytes.from_hex()
- Next message: [Python-Dev] bytes.from_hex()
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]