[Python-Dev] PEP 263 considered faulty (for some Japanese) (original) (raw)

SUZUKI Hisao suzuki611@oki.com
Wed, 13 Mar 2002 11:05:39 +0900


Please understand that the issue you bring up was specifically added on request of Japanese users.

Thank you. The statements in the PEPs Parade had frightened me terribly.

> The PEP says, "Just as in coercion of strings to Unicode, > Python will default to the interpreter's default encoding (which > is ASCII in standard Python installations) as standard encoding > if no other encoding hints are given." This will let many > English people free from writing the magic comment to their > scripts explicitly.

While that is true, it will in particular free Japanese users from putting encoding declarations in their files. Japanese users often declare the default encoding to be shift-jis or euc-jp. When Python source code is transmitted between Unix and Windows, tools are used to convert files between these two encodings. If there is an encoding declaration, those tools would need to change this, too, but the existing tools don't.

I should have appended to that, "And English people will distribute programs with no magic comments all over the world. Japanese users will use them."

Certainly Japanese users are also free from putting encoding declarations, but we do not expect such programs to be usable in other countries than Japan, given the PEP as is.

BTW, when transmitting Python source code between Unix and Windows, we do not necessarily convert encodings. Any encodings which are strictly ASCII compatible, such as EUC-JP and UTF-8, are usable both Unix and Windows for really many uses (for example, for a little multi-threaded Web server. See http://www.python.jp/Zope/download/pypage/ ). Certainly some may convert encodings almost always, but others do not necessarily.

Therefore, it was considered desirable to not use an encoding declaration if the default encoding matches the file encoding. It is well-understood that files without declared encoding will be less portable across systems.

Indeed Mr. Ishimoto said in [Python-ml-jp] that he thought it would be desirable not to use any encoding declarations, in that he would convert encodings. But he also said later in [Python-ml-jp] that he had withdrawn his suggestion.

> However, many Japanese set the default encoding other than ASCII (we > use multi-byte encodings for daily use, not as luxury), and some > Japanese set it, say, "utf-16".

I cannot believe this statement. Much of the standard library will break if you set the default encoding to utf-16; any sensible setting of the default encoding sets it to an ASCII superset (in the sense "ASCII strings have the same bytes under that encoding"). Anybody setting the default encoding to utf-16 has much bigger problems than source encodings.

Oh, then my coworker must have been lucky till today! Thank you for your advice!!

> It is ASCII-compatible perfectly and language-neutral. If you once > commit yourself to Unicode, I think, UTF-8 is an obvious choice > anyway.

I certainly agree. Under the PEP, you can put the UTF-8 signature (BOM encoded as UTF-8) in all files (or ask your text editor to do that for you), and you won't need any additional encoding declaration. Windows notepad does that already if you ask it to save files as UTF-8, and I'd assume other editors will offer that feature as well.

Just one worry: it may be incompatible with '#!/usr/bin/env' used in Unix.

In any case, choice of source encoding, under the PEP, is the user's choice. The option of making UTF-8 the standard encoding for all source files has been explicitly considered and was rejected.

I understand that making UTF-8 the standard encoding immediately for all source files does not have feasibility. I'd think we have had two options:

  1. Wait until when the UTF-8 is popular, and then adopt the UTF-8 the sole encoding. (The time might come within two years.)

  2. Make Python able to handle various encodings which are in use now.

And I'd believe that once taking the option 2., it would be still sensible to make UTF-8 the default encoding.

-- SUZUKI Hisao <suzuki@acm.org> <suzuki611@okisoft.co.jp>