[Python-Dev] PEP 263 considered faulty (for some Japanese) (original) (raw)

SUZUKI Hisao suzuki@acm.org
Sat, 16 Mar 2002 16:25:05 JST


In message <m34rjj12l7.fsf@mira.informatik.hu-berlin.de>

> And almost every operating system in Japan is on the way to > adopt Unicode to save us from the chaos. I am afraid the > mileage of the PEP will be fairly short and just results in > loading a lot of burden onto the language,

That is a mis-perception. The PEP does not add a lot of burden onto the language; the stage-1 implementation is fairly trivial. The biggest change in the interpreter will be to have the parser operate on Unicode all the time; that change will be necessary stage 2, whether UTF-8 will be the default encoding or not.

I see. After all, the java compiler performs such a task now.

But I wonder what about codecs of the various encodings of various countries. Each of Main-land China, Taiwan, Korea, and Japan has its own encoding(s). They will have their own large table(s) of truly many characters. Will not this make the interpreter a huge one? And once UTF comes to major, will not they become a huge legacy?

Maybe each local codec(s) must be packed into a so-called Country Specific Package, which can be optional in the Python distribution. I believe you have considered such thing already. In additon, I see this problem does not relate to PEP 263 itself in the strict sense. The PEP just makes use of codecs which happen to be there, only requiring that each name of them must match with that of Emacs, doesn't it?

Also, the warning framework added will help people migrating - whether towards UTF-8 or custom locally-declared encodings is their choice.

As for declared encodings, I have one thing to say. And this is another point where THE CURRENT PEP CONSIDERED FAULTY FOR SOME JAPANESE. (It relates to UTF-16 again. sigh)

In short:

If the current PEP regards UTF-8 BOM, why it does not allow UTF-16 with BOM? The implementation would be very trivial. UTF-16 with BOM is becoming somewhat popular among casual users in Japan.

In long:

It is true that many Japanese developers do not use UTF-16 at all (and may be even suspicious of anyone who talks about the use of it ;-). However, the rest of us sometimes use UTF-16 certainly. You can edit UTF-16 files with, say, jEdit (www.jedit.org) on many platforms, including Unix and Windows. And in particular, you can use TextEdit on Mac. TextEdit on Mac OS X is a counterpart of notepad and wordpad on Windows.

UTF-16 is typically 2/3 size to UTF-8 when many CJK chararcters are used (each of them are 3 bytes in UTF-8 and 2 bytes in UTF-16). And in particular on Japanese Mac, it has more support than other plain-text encodings. In the default setting, TextEdit saves a plain-text file in Shift_JIS or UTF-16. Shift_JIS suffers from the lack of several important characters which are used in the real life (most notably, several Kanji used in some surnames... Yes, there is a fair motivation among casual Japanese to use Unicode!). Once non-Shift_JIS character is used in the file, it will be saved in UTF-16 by default (not UTF-8, regrettably. Perhaps it may be because of the "mojibake" problem partly).

Now iBook, iMac and PowerMac are fairly popular among casual users in Japan; they are almost always within the top 10 in PC sales. Since Mac OS X has become the default recently, casual users are very likely to happen to write their scripts in UTF-16 with TextEdit.

Since TextEdit has similar key-bindings to Emacs, even power users

may want to use it to edit his script. Indeed I do so.

By the way, I had reported another problem which may make PEP 263 faulty, you know. There had been a project which one must operate on certain fixed-length texts in utf-16-be. In java world such data are not so rare. It was where that encoding was used as default. But I now see it would be reasonable not to depend on default in such cases. Anyway one could say that was a special case...

But this is not so. UTF-16 files is becoming popular among not-so-little users of Mac in Japan. Easy usability of various characters which are not found in classic JIS but in Unicode attracts some Japanese indeed. (Look into several Mac magazines in Japan, if you can.)

As the programming language for everyone, it will be very nice for Python to accept such scripts. I believe the orthogonality has been also one of the most imporant virtues of Python.

The implementation would be fairly straight. If the file begins in either 0xFE 0xFF or 0xFF 0xFE, it must be UTF-16.

> I know it is not the best practice either. However, you cannot > safely write ShiftJIS into Python source file anyway, unless > you hack up the interpreter parser itself for now.

In stage 2 of the PEP, this will be possible (assuming Python has access to a ShiftJIS codec).

Yes, I have appreciated the PEP on this very possibility. We will be also able to use even ISO-2022-JP.

If the stage2 comes soon within a year and it is highly stable, it may be EXTREMELY useful in Japan. Or else, I am afraid it might become bear's service... (Maybe only UTF - Unicode codecs will be relevant...)

-- SUZUKI Hisao <suzuki@acm.org> "Bye bye, Be!"