[Python-Dev] PEP 263 -- Python Source Code Encoding (original) (raw)
Tim Peters tim.one@comcast.net
Wed, 27 Feb 2002 03:25:02 -0500
- Previous message: [Python-Dev] PEP 263 -- Python Source Code Encoding
- Next message: [Python-Dev] PEP 263 -- Python Source Code Encoding
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
[M.-A. Lemburg]
Jack had the same question. The simple answer is: we need this in order to maintain backward compatibility when we move to phase two of the implementation.
Here's the longer one: ASCII is the standard encoding for Python keywords and identifiers. There is no standard source code encoding for string literals.
But there is:
Python uses the 7-bit ASCII character set for program text and
string literals. 8-bit characters may be used in string literals
and comments but their interpretation is platform dependent; the
proper way to insert 8-bit characters in string literals is by
using octal or hexadecimal escape sequences.
The Ref Man has said "7-bit ASCII" for both "program text and string literals" for a long time. The formal grammar in the Ref Man agrees with this (including the formal grammar for Unicode literals). It's an historical accident that the tokenizer happened to use C isalpha() to "enforce" this for identifiers, and that C isalpha() happened to grow locale-dependence while Guido was too drunk with power to notice .
Unicode literals are interpreted using 'unicode-escape' which is an enhanced Latin-1 with escape semantics.
I'm sure they do "act like" Latin-1 on your box, and that identifiers also act like Latin-1 was in effect on your box. But the Ref Man explicitly says all that is platform dependent; there's no "backward compatibility" to preserve here beyond 7-bit ASCII unless you want to preserve that Python always rely on what C isalpha() says.
- Previous message: [Python-Dev] PEP 263 -- Python Source Code Encoding
- Next message: [Python-Dev] PEP 263 -- Python Source Code Encoding
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]