[Python-Dev] PEP 263 considered faulty (for some Japanese) (original) (raw)

SUZUKI Hisao suzuki611@oki.com
Tue, 12 Mar 2002 19:57:35 +0900


Thank you for reading my message.

Is your objection specifically focused on UTF-16? As far as I understand, UTF-16 is (mostly) a two-byte encoding, that is not a superset of ASCII (i.e. the 8-bit string "abcd", when interpreted using UTF-16, does not mean the same thing as the Unicode string u"abcd"). This sets UTF-16 apart from most other encodings, in particular UTF-8, but also (I believe) the common Japanese 8-bit encodings like Shift-JIS and EUC-JP.

Yes, UTF-16 is a two-byte encoding and not a superset of ASCII. We use ISO-2022-JP, EUC-JP, UTF-8 and Shift_JIS where ASCII-compatibility is needed more or less (for example, e-mail messages and program source codes).

In addition, we often have to handle Japanese documents written in UTF-16. They are produced sometimes by Java programs, and sometimes by text editors. Some of us currently use Unicode mainly for them.

You write "set the default encoding". There are many ways to set a default encodings. Python has a very specific way to set its default encoding: the only way is to edit the site.py library module. Is this what you are referring to?

Yes.

I would think that setting Python's default encoding to UTF-16 in this way is a bad idea, because it breaks the main purpose of the default encoding: to allow an automatic coercion from the 8-bit strings that are used in many places in Python programs to Unicode strings. [...] For this reason, I find it hard to believe that people really set the Python default encoding in site.py to "utf-16". Maybe I'm wrong -- or maybe you're talking about a different default encoding?

What we handle in Unicode with Python is often a document file in UTF-16. The default encoding is mainly applied to data from the document. Certainly we use EUC-JP etc. in Python scripts, but mostly use them as comments or some such things.

Setting the default to UTF-16 is often a handy way to handle Unicode for the present.

It sounds like these people never rely on the implicit conversion between 8-bit strings and Unicode as I showed above, but instead use explicit conversions from data read from Unicode files, omitting the encoding. So maybe you really do mean what I fear (setting Python's default encoding to UTF-16 in site.py).

Yes, I mean such things. Please note that u'' is interpreted just literally and we cannot put Japanese characters in string literals legally for now anyway.

Python 2.2 (#1, Jan 16 2002, 12:05:05) [GCC egcs-2.91.66 19990314/Linux (egcs-1.1.2 release)] on linux2 Type "help", "copyright", "credits" or "license" for more information.

import sys sys.getdefaultencoding() 'utf_16_be' u'abc' u'abc' unicode("\x00a\x00b\x00c") u'abc'

> I would propose that Python should default to ASCII as > standard encoding if no other encoding hints are given, as the > bottom line. The interpreter's default encoding should not be > referred for source code. Unfortunately, this doesn't work for people in Europe, who set Latin-1 as the default encoding, and want to use Latin-1 in their source files.

And neither for another some of us in Japan, who set EUC-JP as the default encoding, and want to use EUC-JP in their source files.

I think I can propose a compromise though: there may be two default encodings, one used for Python source code, and one for data. Normally, a single default encoding is used for both cases, but it is possible to specify two different defaults, and then persons who like UTF-16 can set ASCII as the default source encoding but UTF-16 as the default data encoding.

It sounds very nice. I understand that the default data encoding will be applied to what from file objects. It must be the only(?) satisfying solution if the default source encoding is to be set in site.py.

Or else we should give up the default encoding for data...

> And I hope that Python defaults to UTF-8 as standard encoding > if no other encoding hints are given. It is ASCII-compatible > perfectly and language-neutral. If you once commit yourself to > Unicode, I think, UTF-8 is an obvious choice anyway. I'm not sure I understand. (I understand UTF-8 perfectly well. :-) In the previous paragraph you propose to default to ASCII. In this paragraph you propose to default to UTF-8. Now which do you want? Or do you want to propose these two for different situations?

I'm sorry for the ambiguity. I proposed ASCII as the minimum request. I'd hope UTF-8.

Note that I originally wanted to use UTF-8 as the default encoding, but was convinced otherwise by the Europeans who rarely use UTF-8 but often Latin-1: but rather than giving anyone preferential treatment (except for the Americans, whose keyboads don't have keys that generate non-ASCII characters :-), I decided that the only fair solution was to default to ASCII, which has the property that any non-ASCII characters are considered an error. But of course, the option to edit site.py sort of defeats this purpose. :-)

ASCII can express, I believe, only English and classical Latin well. It would be safe to say that it is unfair for all people in the world except for English-speaking people.

Once committed to Unicode, and if ASCII-compatibility is mandatory, UTF-8, which is language-neutral, seems to be the only fair solution to everyone.

Of course, it might not be so if committed to ISO-2022...

> From my experiences, inserting the '-*- coding: > -*-' line into an existing file and converting such a file into > UTF-8 are almost the same amount of work. Yes, for those people who have a UTF-8 toolchain set up. I expect that many Europeans don't have one handy, because their needs are met by Latin-1.

Writing a converter from Latin-1 to UTF-8 is an easy exercise in Python programming. For a UTF-8 editor, IDLE on Tck/Tk8.3 may be handy.

Those who want to use Latin-1 in the source code can always specify '-- coding: latin-1 --'.

> We will be glad if Python understands Japanese (and other) > characters by default (by adopting, say, UTF-8 as default).

I think that in the future, we be able to change the default to UTF-8. Picking ASCII as the "official" default has the advantage that it will let us switch to UTF-8 in the future, when we feel that there is enough support for UTF-8 in the average computer system.

If one does not have enough support for UTF-8, and has some 8-bit clean editor (which is mandatory for Latin-1), I think, UTF-8 is effectively the same as ASCII -- one can program entirely in ASCII and cannot put national characters directly.

Others may distribute programs in UTF-8, but they will restrict the usage of national characters reasonably (say, to their signatures) if they want to make their programs open to all over the world effectively. Natinal characters may be displayed as some meta characters on the editor.

Of course, it may be not the case if the program depends deeply

upon the local culture... (One example comes to my mind: I Ching

-- it is pan-East-Asian)

-- SUZUKI Hisao <suzuki@acm.org> <suzuki611@okisoft.co.jp>