[Python-Dev] PEP 461 updates (original) (raw)

Stephen J. Turnbull stephen at xemacs.org
Fri Jan 17 10:59:30 CET 2014


Steven D'Aprano writes:

On Fri, Jan 17, 2014 at 11:19:44AM +0900, Stephen J. Turnbull wrote:

"ASCII compatible" is a technical term in encodings, which means "bytes in the range 0-127 always have ASCII coded character semantics, do what you like with bytes in the range 128-255."[1]

Examples, and counter-examples, may help. Let me see if I have got this right: an ASCII-compatible encoding may be an ASCII-superset like Latin-1, or a variable-width encoding like UTF-8 where the ASCII chars are encoded to the same bytes as ASCII, and non-ASCII chars are not. A counter-example would be UTF-16, or some of the Asian encodings like Big5. Am I right so far?

All correct.

But Nick isn't talking about an encoding, he's talking about a data format. I think that an ASCII-compatible format means one where (in at least some parts of the data) bytes between 0 and 127 have the same meaning as in ASCII, e.g. byte 84 is to be interpreted as ASCII character "T". This doesn't mean that every byte 84 means "T", only that some of them do -- hopefully a well-defined sections of the data. Below, you introduce the term "ASCII segments" for these.

Yes, except that I believe Nick, as well as the "file-and-wire guys", strengthen "hopefully well-defined" to just "well-defined".

<specified bytes methods> are designed for use *only* on bytes
that are ASCII segments; use on other data is likely to cause
hard-to-diagnose corruption.

An example: if you have the byte b'\x63', calling upper() on that will return b'\x43'. That is only meaningful if the byte is intended as the ASCII character "c".

Good example.



More information about the Python-Dev mailing list