[Python-3000] PEP 3131 accepted (original) (raw)

Stephen J. Turnbull stephen at xemacs.org
Wed May 23 13:07:57 CEST 2007


Josiah Carlson writes:

From identical character glyph issues (which have been discussed off and on for at least a year),

In my experience, this is not a show-stopping problem. Emacs/MULE has had it for 20 years because of the (horrible) design decision to attach charset information to each character in the representation of text. Thus, MULE distinguishes between NO-BREAK SPACE and NO-BREAK SPACE (the same!) depending on whether the containing text "is" ISO 8859-15 or "is" ISO 8859-1. (Semantically this is different from the identical glyph, different character problem, since according to ISO 8859 those characters are identical. However, as a practical matter, the problem of detecting and dealing with the situation is the same as in MULE the character codes are different.)

How does Emacs deal with this? Simple. We provide facilities to identify identical characters (not relevant to PEP 3131, probably), to highlight suspicious characters (proposed, not actually implemented AFAIK, since identification does what almost all users want), and to provide information on characters in the editing buffer. The remaining problems with coding confusion are due to deficient implementation (mea maxima culpa).

I consider this to be an editor/presentation problem, not a language definition issue.

Note that Ka-Ping's worry about the infinite extensibility of Unicode relative to any human being's capacity is technically not a problem. You simply have your editor substitute machine-generated identifiers for each identifier that contains characters outside of the user's preferred set (eg, using hex codes to restrict to ASCII), then review the code. When you discover what an identifier's semantics are, you give it a mnemonic name according to the local style guide. Expensive, yes. But cost is a management problem, not the kind of conceptual problem Ka-Ping claims is presented by multilingual identifiers. Python is still, in this sense, a finitely generated language.

to editing issues (being that I write and maintain a Python editor)

Multilingual editing (except for non-LTR scripts) is pretty much a solved problem, in theory, although adding it to any given implementation can be painful. However, since there are many programmer's editors that can handle multilingual text already, that is not a strong argument against PEP 3131.

Yes, PEP 3131 makes writing software in Python easier for some, but for others, it makes maintenance of 3rd party code a potential nightmare (regardless of 'community standards' to use ascii identifiers).

Yes, there are lots of nightmares. In over 15 years of experience with multilingual identifiers, I can't recall any that have lasted past the break of dawn, though.

I just don't see such identifiers very often, and when I do, they are never hard to deal with. Admittedly, I don't ever need to deal with Arabic or Devanagari or Thai, but I'd be willing to bet I could deal with identifiers in those languages, as long as the syntax is ASCII.

As for third party code, "the doctor says that if you put down that hammer, your head will stop hurting". If multilingual third party code looks like a maintenance risk, don't deal with that third party.[1] Or budget for translation up front; translators are quite a bit cheaper than programmers.

BTW, "find . -name '*.py' | xargs grep -l '[^[:ascii:]]'" is a pretty cheap litmus test for your software vendors! And yes, it should be looking into strings and comments. In practice (once I acquired a multilingual editor), handling non-English strings and comments has been 99% of the headache of maintaining code that contains non-ASCII.

I've been maintaining the edict.el library, an interface to Jim Breen's Japanese-English dictionary EDICT for XEmacs for 10 years (there was serious development activity for only about the first 2, though). A large fraction of the identifiers specific to that library contain Japanese characters (both ideographic kanji and syllabic kana, as well as the pseudo-namespace prefix "edict-" in ASCII). There are several Japanese identifiers in there whose meaning I still don't know, except by referring to the code to see what it does (they're technical terms in Japanese linguistics, I believe, and probably about as intelligible to the layman as terms in Dutch tax law). At the time I started maintaining that library, I did so because I couldn't read Japanese (obviously!)

This turned out to pose no problem. Japanese identifiers were not visually distinct to me, but when I needed to analyze a function, I became familiar with the glyphs of related identifiers quickly. And having an intelligible name to start with wouldn't have helped much; I needed to analyze the function because it wasn't doing what I wanted it to do, not because I couldn't translate the name.

There are other packages in XEmacs which use non-ASCII, non-English identifiers, but they are rare. Maintaining them has never been reported as a problem.

N.B. This is limited experience with what many might characterize as a niche language. And I'm an idiosyncratic individual, blessed with a reasonable amount of talent at language learning. Both valid points.

However, I think the killer point in the above is the one about strings and comments. If you can discipline your team to write comments and strings in ASCII/English, extending that to identifiers is no problem. If your team insists on multilingual strings/comments, or needs them due to the task, multilingual identifiers will be the least of your problems, and the most susceptible to technical solution (eg, via identification and quarantine by cross-reference tables).

Granted, this is going to be a more or less costly transition for ASCII-only Pythonistas. I think we should focus on cost-reduction, not on why it shouldn't happen.

Footnotes: [1] Yes, I know, in the real world sometimes you have to. Multilingual identifiers are the least of your worries when dealing with a monopoly supplier.



More information about the Python-3000 mailing list