[Python-Dev] PEP 263 considered faulty (for some Japanese) (original) (raw)

Stephen J. Turnbull stephen@xemacs.org
18 Mar 2002 20:48:49 +0900

Previous message: [Python-Dev] PEP 263 considered faulty (for some Japanese)
Next message: [Python-Dev] PEP 263 considered faulty (for some Japanese)
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

"Martin" == Martin v Loewis <martin@v.loewis.de> writes:

Martin> That is simply not true. The encoding applies to the
Martin> entire source code.

Martin> It is only that it is processed just for Unicode literals,

Would you please unpack this? As it stands it looks like an oxymoron.

Martin> and this is a documented deviation of the language
Martin> implementation from the language spec.

I don't see any need for a deviation of the implementation from the spec. Just slurp in the whole file in the specified encoding. Then cast the Unicode characters in ordinary literal strings down to bytesize (my preference, probably with errors on Latin-1<0.5 wink>) or reencode them (Guido's and your suggestion). People who don't like the results in their non-Unicode literal strings (probably few) should use hex escapes. Sure, you'll have to rewrite the parser in terms of UTF-16. But I thought that was where you were going anyway.

If not, it should be nearly trivial to rewrite the parser in terms of UTF-8 (since it is a superset of ASCII and non-ASCII is currently only allowed in comments or guarded by a (Unicode)? string literal AFAIK). The main issue would be anything that involves counting characters (not bytes!), I think. Everything else is a no-op because high-bit- set octets only occur in whole-character units and in things that could be considered single tokens (string literals and comments), so just keep glomming them on the current token until you find any of the token-ending characters in the current ASCII-based implementation. No need to change any syntax. Transforming the UTF-8 to UTF-16 for Unicode string literals is fast, easy to implement, and guaranteed invertible (modulo the UTF-32 vs UCS-4 issue).

The UTF-8 strategy probably gives you identifiers containing arbitrary characters reliably (that is, as reliable as anything that admits more than one encoding can be) and nearly for free, in the same way as you get arbitrary string data and comments. It's debatable whether that's a good thing, of course. (Except for the obfuscators, to whom "it's all Greek to me" will be music to their ears.)

-- Institute of Policy and Planning Sciences http://turnbull.sk.tsukuba.ac.jp University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN Don't ask how you can "do" free software business; ask what your business can "do for" free software.

Previous message: [Python-Dev] PEP 263 considered faulty (for some Japanese)
Next message: [Python-Dev] PEP 263 considered faulty (for some Japanese)
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]