[Python-Dev] PEP 263 considered faulty (for some Japanese) (original) (raw)

Stephen J. Turnbull stephen@xemacs.org
12 Mar 2002 21🔞29 +0900

Previous message: [Python-Dev] PEP 263 considered faulty (for some Japanese)
Next message: [Python-Dev] PEP 263 considered faulty (for some Japanese)
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

"Guido" == Guido van Rossum <guido@python.org> writes:

Guido> [Not having one-octet ASCII as a subset] sets UTF-16 apart
Guido> from most other encodings, in particular UTF-8, but also (I
Guido> believe) the common Japanese 8-bit encodings like Shift-JIS
Guido> and EUC-JP.

This is correct; all of the encodings commonly used in Japan have the property that one-octet ASCII is a subset (depending on how you define "subset" for modal encodings like JUNET). I've never seen UTF-16 "in the wild", but it's possible some groups do use it internally. But I would expect that Python (with its well-organized codec interface) would present small problem compared to ordinary text editors (including both Emacsen) and other commonly used applications. As far as I know none of the freely available recoding utilities (except GNU recode and GNU iconv, which are not tuned to the Japanese environment) support UTF-16. So it would be a very special environment.

Guido> "abcd" interpreted as UTF-16 is a two-character Unicode
Guido> string (and I wouldn't be surprised if it contained invalid
Guido> code points).

Fear not, they're in the middle of the CJK block. The second is invalid in Japanese, though.

Guido> I think I can propose a compromise though: there may be two
Guido> default encodings, one used for Python source code, and one
Guido> for data.

Why go in this direction? It's better to allow each individual stream to specify a codec to be implicitly applied, I think. Consider Emacs, for example, which allows specification of default codecs for (1) file contents (2) names of file system objects (3) process I/O (but not I and O and E separately, which has caused problems!) (4) console input and (5) console output. All of those are plausible candidates for having separate defaults in Python as well.

For example, in Japan it's easy to imagine a program with local file contents defaulting to UTF-8 (for cross-system portability) needing to access the Windows 9x console and file system in Shift JIS, while process (eg, network) I/O might be EUC-JP if the server were Unix. (Yes, I'm straining, but not much.)

But if you allow codecs for each stream, people who want to have different defaults for certain classes of stream would just derive classes which initialized the default codec appropriately.

-- Institute of Policy and Planning Sciences http://turnbull.sk.tsukuba.ac.jp University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN Don't ask how you can "do" free software business; ask what your business can "do for" free software.

Previous message: [Python-Dev] PEP 263 considered faulty (for some Japanese)
Next message: [Python-Dev] PEP 263 considered faulty (for some Japanese)
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]