[Python-Dev] bytes.from_hex() (original) (raw)

Ron Adam rrr at ronadam.com
Fri Feb 24 23:46:00 CET 2006

Previous message: [Python-Dev] bytes.from_hex()
Next message: [Python-Dev] bytes.from_hex()
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

The following reply is a rather longer than I intended explanation of why codings (and how they differ) like 'rot' aren't the same thing as pure unicode codecs and probably should be treated differently. If you already understand that, then I suggest skipping this. But if you like detailed logical analysis, it might be of some interest even if it's reviewing the obvious to those who already know.

(And hopefully I didn't make any really obvious errors myself.)

Stephen J. Turnbull wrote:

"Ron" == Ron Adam <rrr at ronadam.com> writes: Ron> We could call it transform or translate if needed. You're still losing the directionality, which is my primary objection to "recode". The absence of directionality is precisely why "recode" is used in that sense for i18n work.

I think your not understanding what I suggested. It might help if we could agree on some points and then go from there.

So, lets consider a "codec" and a "coding" as being two different things where a codec is a character sub set of unicode characters expressed in a native format. And a coding is not a subset of the unicode character set, but an opperation performed on text. So you would have the following properties.

codec ->  text is always in *one_codec* at any time.

coding ->  operation performed on text.

Lets add a special default coding called 'none' to represent a do nothing coding. (figuratively for explanation purposes)

'none' -> return the input as is, or the uncoded text

Given the above relationships we have the following possible transformations.

codec to like codec: 'ascii' to 'ascii'
codec to unlike codec: 'ascii' to 'latin1'

And we have coding relationships of:

a. coding to like coding # Unchanged, do nothing b. coding to unlike coding

Then we can express all the possible combinations as...

[1.a, 1.b, 2.a, 2.b]


1.a -> coding in codec to like coding in like codec:

    'none' in 'ascii' to 'none' in 'ascii'

1.b -> coding in codec to diff coding in like codec:

    'none' in 'ascii' to 'base64' in 'ascii'

2.a -> coding in codec to same coding in diff codec:

    'none' in 'ascii' to 'none' in 'latin1'

2.b -> coding in codec to diff coding in diff codec:

    'none' in 'latin1' to 'base64' in 'ascii'

This last one is a problem as some codecs combine coding with character set encoding and return text in a differnt encoding than they recieved. The line is also blurred between types and encodings. Is unicode and encoding? Will bytes also be a encoding?

Using the above combinations:

(1.a) is just creating a new copy of a object.

s = str(s)

(1.b) is recoding an object, it returns a copy of the object in the same encoding.

s = s.encode('hex-codec')  # ascii str -> ascii str coded in hex
s = s.decode('hex-codec')  # ascii str coded in hex -> ascii str

these are really two differnt operations. And encoding repeatedly results in nested codings. Codecs (as a pure subset of unicode) don't have that property.
the hex-codec also fit the 2.b pattern below if the source string is of a differnt type than ascii. (or the the default string?)

(2.a) creates a copy encoded in a new codec.

s = s.encode('latin1')

I beleive string constructors should have a encoding argument for use with unicode strings.

s = str(u, 'latin1') # This would match the bytes constructor.

(2.b) are combinations of the above.

s = u.encode('base64') # unicode to ascii string as base64 coded characters

u = unicode(s.decode('base64')) # ascii string coded in base64 to unicode characters

u = unicode(s, 'base64') Traceback (most recent call last): File "", line 1, in ? TypeError: decoder did not return an unicode object (type=str)

Ooops... ;)

So is coding the same as a codec? I think they have different properties and should be treated differently except when the practicality over purity rule is needed. And in those cases maybe the names could clearly state the result.

u.decode('base64ascii')  # name indicates coding to codec

A string. -> QSBzdHJpbmcu -> UVNCemRISnBibWN1

Looks like the underlying sequence is:

  native string -> unicode -> unicode coded base64 -> coded ascii str

And decode operation would be...

  coded ascii str -> unicode coded base64 -> unicode -> ascii str

Except it may combine some of these steps to speed it up.

Since it's a hybred codec including a coding operation. We have to treat it as a codec.

Ron> * Given that the string type gains a codec attribute Ron> to handle automatic decoding when needed. (is there a reason Ron> not to?)

Ron> str(object[,codec][,error]) -> string coded with codec Ron> unicode(object[,error]) -> unicode Ron> bytes(object) -> bytes str == unicode in Py3k, so this is a non-starter. What do you want to say? Ron> * a recode() method is used for transformations that Ron> donot change the current codec. I'm not sure what you mean by the "current codec". If it's attached to an "encoded object", it should be the codec needed to decode the object. And it should be allowed to be a "codec stack".

I wasn't thinking in terms of stacks, but in that case the current codec would be the top of the stack. I think stackable codecs is a very bad idea for the record.

Back to recode vs encode/decode, the example used above might be useful.

s = s.encode('hex-codec')  # ascii str -> ascii str coded in hex
s = s.decode('hex-codec')  # ascii str coded in hex -> ascii str

In my opinion these are actually too very different (although related) operations that would be better expressed with different names.

Curently it's a hybred codec that converts it's input to an ascii string (or default encoding?), but when decoding you end up with an ascii encoding even if you started with something else. So the decode isn't a true inverse to encode in some cases.

As a coding operation it would be.

u = u.recode('to_hex')
u = u.recode('from_hex')

Where this would work with both unicode and strings without changing the codec.

It also keeps the 'if i do it again' it will recode the coded text' relationship. So I think the name is appropriate. IMHO

Pure codecs such as latin-1 can be envoked over and over and you can always get back what you put in in a single step.

s = 'abc' for n in range(100): ... s = s.encode('latin-1') ... print s, type(s) abc <type 'str'>

Supposedly a lot of these issues will go away in Python 3000. And we can probably live with the current state of things. But even after Python 3000 it seems to me we will still need access to codecs as we may run across encoded text input from various sources.

Cheers, Ron

Previous message: [Python-Dev] bytes.from_hex()
Next message: [Python-Dev] bytes.from_hex()
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list