[Python-3000] Support for PEP 3131 (original) (raw)

Stephen J. Turnbull stephen at xemacs.org
Sat May 26 09:42:57 CEST 2007


Jim Jewett writes:

How about a regexp character class as starting point?

I'm not sure I understand. Do you mean that part of localization should be defining what certain regular expressions should match?

No, I meant simply a list of character ranges, as characters. The definition of "safe ASCII" would be something like

r"\t\r\n -~"

Your table format is better. If people want to put the actual characters in comments (maybe in source files to be preprocessed before installation), let them.

So long as we allow tailoring, I think the maximal set should be generous -- and I don't see any reason to pre-exclude anything outside ASCII.

Cf characters? Are we admitting "stupid bidi tricks", too?

But I'll tell you what my reason is: we want to be in a position to avoid prohibiting previously acceptable characters wherever possible.

There are people who like to use names like "Program Files" or "Summary of Results.Apr-3-2007 version 2.xls"; I expect the same will be true of identifiers. So long as the punctuation is not ASCII, we might as well let them.

Why not let them use ASCII punctuation, as long as it's not Python syntax?

Ie, for one thing, we might want to do something with that punctuation some day. For example, I could imagine using guillemots to denote rawstrings or to substitute for triple quotes. Local parsing (as done by program editors) would be easier with directed quotes. Etc. For reasons of visual distinctiveness, we might choose to use Chinese or Arabic versions.

The other committees say to exclude certain scripts, like Linear B and Ogham. And not to allow mixed scripts, at least if they're confusable. But I really don't want to explain why someone using Cyrillic can't use certain (apparently to him) randomly determined identifiers just because it could be confused with ASCII (or Armenian).

-1 on restrictions according to confusability or the block. That's a matter for personal judgement, and there are cheap technical solutions for those who want to use confusable Cyrillic or Linear B and still avoid confusion. I think those restrictions are an idea that must be available (perhaps as a table we distribute), but I think they'll turn out to suck pretty badly.

If unicode comes out with a new revision, the new characters should probably be allowed; I don't want a situation where users of Cham or Lepcha[1] are told they have to wait another year because their scripts weren't formally adopted into unicode until after python 3.4.0 was already released.

Tough call. I'd say, let's cross that bridge when we come to it.

In any case there will have to be some mechanism to access a Unicode database at either build time or run time. Let them munge that database if they're in a hurry.

Maybe the way to handle this is to allow private-space characters in identifiers as an option. That would be doable with your well-known file scheme. But it's very dangerous across modules.

By the way, this is what the Japanese call the "gaiji" ("outside character") problem. It's a very tough nut to crack; the Japanese never did.



More information about the Python-3000 mailing list