[Python-Dev] Go \x yourself (original) (raw)

Tim Peters tim_one@email.msn.com
Thu, 3 Aug 2000 04:05:31 -0400

Previous message: [Python-Dev] Re: Bookstand at LA Python conference
Next message: [Python-Dev] Go \x yourself
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Offline, Guido and /F and I had a mighty battle about the meaning of \x escapes in Python. In the end we agreed to change the meaning of \x in a backward-incompatible way. Here's the scoop:

In 1.5.2 and before, the Reference Manual implies that an \x escape takes two or more hex digits following, and has the value of the last byte. In reality it also accepted just one hex digit, or even none:

"\x123465" # same as "\x65" 'e' "\x65" 'e' "\x1" '\001' "\x\x" '\x\x'

I found no instances of the 0- or 1-digit forms in the CVS tree or in any of the Python packages on my laptop. Do you have any in your code?

And, apart from some deliberate abuse in the test suite, I found no instances of more-than-two-hex-digits \x escapes either. Similarly, do you have any? As Guido said and all agreed, it's probably a bug if you do.

The new rule is the same as Perl uses for \x escapes in -w mode, except that Python will raise ValueError at compile-time for an invalid \x escape: an \x escape is of the form

\xhh

where h is a hex digit. That's it. Guido reports that the O'Reilly books (probably due to their Perl editing heritage!) already say Python works this way. It's the same rule for 8-bit and Unicode strings (in Perl too, at least wrt the syntax). In a Unicode string \xij has the same meaning as \u00ij, i.e. it's the obvious Latin-1 character. Playing back the above pretending the new rule is in place:

"\x123465" # \x12 -> \022, "3456" left alone '\0223456' "\x65" 'e' "\x1" ValueError "\x\x" ValueError

We all support this: the open-ended gobbling \x used to do lost information without warning, and had no benefit whatsoever. While there was some attraction to generalizing \x in Unicode strings, \u1234 is already perfectly adequate for specifying Unicode characters in hex form, and the new rule for \x at least makes consistent Unicode sense now (and in a way JPython should be able to adopt easily too). The new rule gets rid of the unPythonic TMTOWTDI introduced by generalizing Unicode \x to "the last 4 bytes". That generalization also didn't make sense in light of the desire to add \U12345678 escapes too (i.e., so then how many trailing hex digits should a generalized \x suck up? 2? 4? 8?). The only actual use for \x in 8-bit strings (i.e., a way to specify a byte in hex) is still supported with the same meaning as in 1.5.2, and \x in a Unicode string means something as close to that as is possible.

Sure feels right to me. Gripe quick if it doesn't to you.

as-simple-as-possible-is-a-nice-place-to-rest-ly y'rs - tim

Previous message: [Python-Dev] Re: Bookstand at LA Python conference
Next message: [Python-Dev] Go \x yourself
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]