[Python-Dev] PEP 263 - default encoding (original) (raw)

Stephen J. Turnbull stephen@xemacs.org
16 Mar 2002 12:20:08 +0900


"Guido" == Guido van Rossum <guido@python.org> writes:

>> a. Does this really make sense for UTF-16?  It looks to me like
>> a great way to induce bugs of the form "write a unicode literal
>> containing 0x0A, then translate it to raw form by stripping the
>> u prefix."

Guido> Of course not. I don't expect anyone to put UTF-16 in their
Guido> source encoding cookie.

Mr. Suzuki's friends. People who use UTF-16 strings in other applications (eg Java), but otherwise are happy with English.

Guido> But should we bother making a list of encodings that
Guido> shouldn't be used?

I would say yes. People will find reasons to inflict harm on themselves if you don't.

>> b. No editor is likely to implement correct display to
>> distinguish between u"" and just "".

Guido> That's fine.  Given phase 2, the editor should display the
Guido> entire file using the encoding given in the cookie, despite
Guido> that phase 1 only applies the encoding to u"" literals.
Guido> The rest of the file is supposed to be ASCII, and if it
Guido> isn't, that's the user's problem.

Huh? I thought that people were regularly putting arbitrary text into ordinary strings, and that the whole purpose of this PEP was to extend that practice to Unicode.

Are you going to deprecate the practice of putting KOI8-R into ordinary strings? This means that Cyrillic users have stop doing that, change the string to Unicode, and apply codecs on IO. They aren't going to bother in phase 1, will have a rude surprise in phase 2. That's human nature, of course, but I don't see how it serves Python to risk it.

>> e. This causes problems for UTF-8 transition, since people will
>> want to put arbitrary byte strings in a raw string.

Guido> I'm not sure I understand.  What do you call a raw string?
Guido> Do you mean an r"" literal?  Why would people want to use
Guido> that for arbitrary binary data?  Arbitrary binary data
Guido> should *always* be encoded using \xDD hex or \OOO octal
Guido> escapes.

raw -> non-Unicode here. Incorrect usage, my apologies. "Arbitrary" was the wrong word too, I mean non-UTF-8. Eg, iso-8859-1 0xFF. I would have not problem with requiring people to use escapes to write non-English strings. But the whole point of this PEP is to allow people to write those in their native encodings (for Unicode strings). People are going to continue to squirt implicitly coded octet-strings at their terminals (which just happen to have an appropriate font installed) and expect it to work.

AFAICT this interpretation of the PEP saves no pain, simply postpones it. Worse, people who don't understand it fully are going to believe it sanctions arbitrary encodings in string literals. I don't see how you can avoid widespread misunderstanding of that sort unless you have the parser refuse to execute the program---it may actually increase the pain when phase 2 starts.

Guido> Sounds like a YAGNI to me.

Could be. I'm sorry I can't be less fuzzy about the specific points. But then, that's the whole problem, really---we're trying to serve natural language usage which is inherently fuzzy.

I see lots of potential problems in interpretation of this PEP by the people it's intended to serve: those who are attached to some native encoding. Better to raise each now, and have the scorn it deserves heaped high, than to say "we coulda guessed this would happen" later.

If you think it's getting too abstract to be useful, I'll be quiet until I've got something more concrete. I'm hoping the the discussion seems useful despite the fuzz.

-- Institute of Policy and Planning Sciences http://turnbull.sk.tsukuba.ac.jp University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN Don't ask how you can "do" free software business; ask what your business can "do for" free software.