[Python-Dev] PEP 263 considered faulty (for some Japanese) (original) (raw)

Martin v. Loewis martin@v.loewis.de
19 Mar 2002 21:29:12 +0100


"SUZUKI Hisao" <suzuki@acm.org> writes:

> And TextEdit cannot save as UTF-8?

It can. But doing so suffers from "mojibake".

You mean, it won't read it back in properly? Is that because it won't auto-detect the encoding, or does it not even support opening files as UTF-8? Could it be told to write a UTF-8 signature into the file? Would that help autodetection?

Anyway, until the stage2 comes true, you can write Japanese python files only in either EUC-JP or UTF-8 unless you hack up the interpreter, thus Python remains unsatisfactory to many present Japanese till the day of UTF-8. We should either hurry up or wait still.

I expect that the localization patches that circulate now will continue to apply (perhaps with minimal modifications) after stage 1 is implemented. If the patches are enhanced to do the "right thing" (i.e. properly take into consideration the declared encoding, to determine the end of a string), people won't notice the difference compared to a full stage 2 implementation.

As for UTF-16 with BOM, any text outside Unicode literals should be translated into UTF-8 (not UTF-16). It is the sole logical consequence in that UTF-8 is strictly ASCII-compatible and able to map all the characters in Unicode naturally.

Well, no. If UTF-16 is supported as an input encoding in stage 2, it will follow the same process as any other input encoding: The byte strings literals will be converted back to UTF-16. Any patch that special-cases UTF-16 will be rejected.

You will write source codes in UTF-16 as follows:

s = '' ... u = unicode(s, 'utf-8') # not utf-16!

No, that won't work. Instead, you should write

u = u''

No need to call a function.

N.B. one should write a binary (not character, but, say, image or audio) data literal as follows:

b = '\x89\xAB\xCD\xEF'

I completely agree. Binary data should use hex escapes. That will make an interesting challenge for any stage 2 implementation, BTW: \xAB shall denote byte 0x89 no matter what the input encoding was. So you cannot convert \xAB to a Unicode character, and expect conversion to the input encoding to do the right thing. Instead, you must apply the conversion to the source encoding only for the unescaped characters.

People had been proposing to introduce b'' strings for binary data, to allow to switch 'plain' strings to denote Unicode strings at some point, but this is a different PEP.

Regards, Martin