[Python-Dev] PEP 263 considered faulty (for some Japanese) (original) (raw)

SUZUKI Hisao suzuki@acm.org
Tue, 19 Mar 2002 22:17:47 JST


And TextEdit cannot save as UTF-8?

It can. But doing so suffers from "mojibake".

The primary reason why this is not supported is different, though: it would complicate the implementation significantly, atleast the phase 1 implementation. If people contribute a phase 2 implementation that supports the UTF-16 BOM as a side effect, I would personally reconsider.

OK, I will write a sample implementation of the "stage2" as soon as possible, and put it in the public domain.

Anyway, until the stage2 comes true, you can write Japanese python files only in either EUC-JP or UTF-8 unless you hack up the interpreter, thus Python remains unsatisfactory to many present Japanese till the day of UTF-8. We should either hurry up or wait still.

As for UTF-16 with BOM, any text outside Unicode literals should be translated into UTF-8 (not UTF-16). It is the sole logical consequence in that UTF-8 is strictly ASCII-compatible and able to map all the characters in Unicode naturally. You will write source codes in UTF-16 as follows:

s = '<characters>'
...
u = unicode(s, 'utf-8')  # not utf-16!

This suggests me that the implementation will be somewhat like as Stephen J. Turnbull sketches...

N.B. one should write a binary (not character, but, say, image or audio) data literal as follows:

b = '\x89\xAB\xCD\xEF'

The stage2 implementation will translate it into UTF-8 exactly as follows :-)

b = '\x89\xAB\xCD\xEF'

Hence there is no problem in translating UTF-16 file into UTF-8. (At least, any UTF-16 python file is impossible totally for now, allowing it does not hurt anyone here and there.)

-- SUZUKI Hisao >>> def fib(n): return reduce(lambda x, y: suzuki@acm.org ... (x,x[0][-1]+x[1]), [()]*n, ((0L,),1L))