[Python-Dev] PEP 263 considered faulty (for some Japanese) (original) (raw)

SUZUKI Hisao suzuki611@oki.com
Thu, 14 Mar 2002 15:10:05 +0900


SUZUKI> I should have appended to that, "And English people will SUZUKI> distribute programs with no magic comments all over the SUZUKI> world. Japanese users will use them." But this "just works" as long as the default encoding is an ASCII superset (or even JIS X 0201 (^^; as Japanese users are now all equipped with YEN SIGN <-> REVERSE SOLIDUS codecs).

Yes, this is the problem I found.

SUZUKI> Certainly Japanese users are also free from putting SUZUKI> encoding declarations, but we do not expect such programs SUZUKI> to be usable in other countries than Japan, given the PEP SUZUKI> as is. But this is also true for everyone else, except Americans. All of the common non-ASCII encodings are non-universal and therefore non-portable, with the exception of UTF-8 and X Compound Text (and the latter is a non-starter in program sources because of the 0x22 problem).

Indeed. If we are to distribute Python programs to various countries, I think, we must write them in UTF-8, anyway. Under the PEP as is, the magic comment or BOM is mandatory, unless all character codes happen to be less than 0x80. This is tedious and somewhat ugly, but not fatal.

I myself objected to this PEP because I think it's far too easy for my Croatian (Latin-2) friend working in Germany to paste a Latin-1 quote into a Latin-2 file. He'll do it anyway on occasion, but if we start insisting now that "Python programs are written in UTF-8", we'll avoid a lot of mojibake. 12 years in Japan makes that seem an important goal. But such multiscript processing is surely a lot more rare in any country but Japan.

I agree with you on the problem of "mojibake". UTF-8 is the sole encoding at present, in which people all over Asia, Europa, or even the World, can cooperate on the same python source file safely.

The PEP will serve us for making various local encodings for the present to be "official". It will not save us from the chaos of the local encodings very much.

And almost every operating system in Japan is on the way to adopt Unicode to save us from the chaos. I am afraid the mileage of the PEP will be fairly short and just results in loading a lot of burden onto the language, though it is not fatal in itself.

SUZUKI> BTW, when transmitting Python source code between Unix and SUZUKI> Windows, we do not necessarily convert encodings. But this is bad practice. You can do it if it works for you, but Python should not avoid useful changes because people are treating different encodings as the same!

I know it is not the best practice either. However, you cannot safely write Shift_JIS into Python source file anyway, unless you hack up the interpreter parser itself for now. Strictly speaking, Shift_JIS is not compatible with ASCII, you know. With the present Python as is, you are safe to write EUC-JP and UTF-8 in source.

On a very serious project, it is reasonable to use the original (i.e., not hacked) interpreter and (either EUC-JP or) UTF-8 both on Unix and Windows even in the "present day, present time".

There's a third option:

3. Make UTF-8 the only encoding acceptable for "standard Python", and insert a hook for a codec to be automatically run on source text. Standard Python would never put anything on this hook, but an optional library would provide other codecs, including one to implement PEP 263. Guido thought the idea has merit, as an implementation. Therefore UTF-8 would be encouraged by Python, but PEP 263 would give official sanction to the -- coding: xxx -- cookie. And this would give you a lot of flexibility for experimentation (eg, with UTF-16 codecs, etc).

Certainly this will not load a burden onto the language itself even if the mileage of the PEP is short.

-- SUZUKI Hisao <suzuki@acm.org> <suzuki611@okisoft.co.jp>