[Python-Dev] PEP 263 considered faulty (for some Japanese) (original) (raw)

Paul Prescod paul@prescod.net
Mon, 18 Mar 2002 07:53:10 -0800


"Stephen J. Turnbull" wrote:

... I don't see any need for a deviation of the implementation from the spec. Just slurp in the whole file in the specified encoding.

That's phase 2. It's harder to implement so it won't be in Python 2.3. They are trying to get away with changing the output of the lexer/parser rather than the input because the lexer/parser code probably predates Unicode and certainly predates Guido's thinking about internationalization issues. We're moving in baby steps.

... Then cast the Unicode characters in ordinary literal strings down to bytesize (my preference, probably with errors on Latin-1<0.5 wink>) or reencode them (Guido's and your suggestion). People who don't like the results in their non-Unicode literal strings (probably few) should use hex escapes. Sure, you'll have to rewrite the parser in terms of UTF-16. But I thought that was where you were going anyway.

Sure, but a partial implementation now is better than a perfect implementation at some unspecified time in the future.

If not, it should be nearly trivial to rewrite the parser in terms of UTF-8 (since it is a superset of ASCII and non-ASCII is currently only allowed in comments or guarded by a (Unicode)? string literal AFAIK). The main issue would be anything that involves counting characters (not bytes!), I think.

That would be an issue. Plus it would be the first place that the Python interpreter used UTF-8 as an internal representation. So it would also be a half-step, but it might involve more redoing later.

Paul Prescod